The real productivity jump on Linux isn't memorizing more commands —
it's learning how to compose small tools into clear
data flows. The pipe operator | embodies the core Unix
philosophy: make each small tool do one thing (grep only filters, awk
only extracts fields, sort only sorts), then chain them into a readable,
debuggable pipeline. This post starts from the data flow model
(stdin/stdout/stderr), systematically explains semantic differences
between pipes and redirection (>, >>,
2>, 2>&1, < each do
what), then fills in typical patterns for log triage, text filtering,
statistical aggregation, and batch processing (when to use
grep/awk/sed/sort/uniq/wc/cut/tr, how to progressively
narrow scope), and uses practical cases (Nginx log analysis, batch file
operations, safe deletion) to cover pitfalls like "spaces and newlines"
(correct usage of find -print0 + xargs -0).
After reading, you should be able to replace many "need to write a
script" small tasks with one or two readable command lines and more
easily understand others' one-liners.
Data Flow Model: stdin/stdout/stderr and File Descriptors
Three Standard Streams
Every Linux process has three standard streams:
| Stream | File Descriptor | Default Behavior | Example |
|---|---|---|---|
| stdin | 0 | Read input from keyboard | cat (waits for user input when no args) |
| stdout | 1 | Output to screen | echo "hello" |
| stderr | 2 | Error output to screen | ls /nonexistent 2>&1 |
Why separate stdout and stderr?
- Normal output and error output can be handled separately (like normal output saved to file, error output displayed on screen)
- Pipe
|only passes stdout (doesn't pass stderr), so error messages don't pollute data flow
Example: 1
2
3ls /nonexistent # Error message outputs to stderr (screen)
ls /nonexistent 2> err.log # Error message redirected to err.log
ls /nonexistent 2>&1 # stderr redirected to stdout (merged into same stream)
File Descriptors (FD)
File descriptors are process's "handles" for opened files, represented by integers:
0: stdin1: stdout2: stderr3+: Files opened by process itself
View process's open file descriptors:
1
ls -l /proc/$$/fd #$$is current shell's PID
Example output: 1
2
3lrwx------ 1 user user 0 /proc/12345/fd/0 -> /dev/pts/0 # stdin
lrwx------ 1 user user 0 /proc/12345/fd/1 -> /dev/pts/0 # stdout
lrwx------ 1 user user 0 /proc/12345/fd/2 -> /dev/pts/0 # stderr
Redirection: Controlling Data Flow Direction
Output Redirection (stdout)
1 | echo "hello" > file.txt # Overwrite (file cleared if exists) |
Common usage: 1
2ls -l > filelist.txt # Save file list
date >> log.txt # Append timestamp to log
Error Output Redirection (stderr)
1 | ls /nonexistent 2> err.log # Error output redirected to err.log |
Redirect Both stdout and stderr
Method 1: 2>&1
(Traditional)
1 | command > output.log 2>&1 # Both stdout and stderr redirected to output.log |
Order matters:
> output.logfirst redirects stdout to output.log2>&1then redirects stderr to stdout's location (also output.log)
Wrong way: 1
command 2>&1 > output.log # Wrong! stderr first redirects to stdout (screen), then stdout redirects to file
Method 2: &>
(Modern, Recommended)
1 | command &> output.log # Both stdout and stderr redirected to output.log |
Discard Output (/dev/null)
/dev/null is a special "black hole" file; data written
to it is discarded.
1 | command > /dev/null # Discard stdout |
Use cases:
- Don't want to see command output (like scripts in cron jobs)
- Only care if command succeeded (check exit code via
$?)
Input Redirection (stdin)
1 | sort < input.txt # Read input from input.txt |
Here-document (multi-line input): 1
2
3
4
5cat <<EOF > config.txt
line 1
line 2
line 3
EOF
Here-string (single-line input): 1
grep "error" <<< "ERROR: something bad"
Pipe Operator: Chaining Commands
Core Concept of Pipes
Unix Philosophy: Each tool does one thing, does it well, then combine them via pipes.
Example: 1
cat access.log | grep "404" | wc -l
Breakdown: 1. cat access.log: Output log content
(stdout) 2. grep "404": Read from stdin, filter lines
containing "404" (stdout) 3. wc -l: Read from stdin, count
lines (stdout)
Why this design?
- Avoids temporary files (data flows in memory, not written to disk)
- Strong readability (each step is clear)
- Easy debugging (can add pipes step by step, see each step's output)
Debugging Pipes: Using tee
tee can output data to both screen and file
simultaneously (like a "T-junction pipe").
1 | cat access.log | grep "404" | tee filtered.log | wc -l |
tee filtered.log: Saves grep output to filtered.log, while passing to next command- This lets you see intermediate results, helpful for debugging
Text Processing Toolchain: grep/awk/sed/cut/tr/sort/uniq
grep: Filter Lines
grep is the most commonly used text filtering tool for
finding matching lines.
Basic usage: 1
2grep "pattern" file # Find lines matching pattern in file
command | grep "pattern" # Find in command output
Common parameters:
-i: Ignore case-v: Invert match (only show lines NOT containing pattern)-n: Show line numbers-A N: Show N lines after match (After)-B N: Show N lines before match (Before)-C N: Show N lines before and after match (Context)-E: Extended regex (supports|,+,?, etc.)-r: Recursively search directory
Practical examples:
1. View Errors in Logs
1 | grep -i "error" /var/log/syslog # Case-insensitive search for error |
2. View Error Context
1 | grep -C 3 "OutOfMemoryError" app.log # Show 3 lines before and after error |
3. Recursively Search Directory
1 | grep -r "TODO" /srv/project # Recursively find TODO in project directory |
4. Count Matches
1 | grep -c "ERROR" app.log # Count lines containing ERROR |
awk: Extract Fields and Aggregate
awk is a powerful tool for processing columnar text
(like logs, CSV, tables).
Basic concepts:
- awk processes text line by line, splitting each line by whitespace (or specified delimiter) into fields
$1is first field,$2is second field,$0is entire line
Common examples:
1. Extract Fields
1 | # Nginx log format: IP - - [time] "GET /path HTTP/1.1" 200 1234 |
2. Filter Lines (Like grep)
1 | awk '/404/ {print$0}' access.log # Only show lines containing 404 |
3. Statistics and Aggregation
1 | # Count occurrences of each status code |
4. Custom Delimiter
1 | # Comma-separated CSV file |
sed: Text Replacement and Editing
sed is a stream editor for text replacement, deletion,
insertion, etc.
Common examples:
1. Replace Text
1 | sed 's/foo/bar/' file.txt # Replace first foo with bar on each line |
2. Delete Lines
1 | sed '/pattern/d' file.txt # Delete lines containing pattern |
3. Insert and Append
1 | sed '1i\First Line' file.txt # Insert text before first line |
cut/tr/sort/uniq: Simple Efficient Text Tools
cut: Extract Fields (Simple Cases)
1 | cut -d',' -f1 data.csv # Extract comma-separated first column |
tr: Character Replacement/Deletion
1 | echo "HELLO" | tr 'A-Z' 'a-z' # Convert to lowercase |
sort: Sorting
1 | sort file.txt # Sort alphabetically |
uniq: Remove Duplicates (Only Adjacent Duplicates)
1 | sort file.txt | uniq # Sort first, then remove duplicates |
Important: uniq can only remove
adjacent duplicate lines, so usually need to
sort first.
Practical Case: Nginx Log Analysis
Suppose you have an Nginx log file access.log, each line
formatted like: 1
2192.168.1.100 - - [28/Jan/2025:12:00:00 +0000] "GET /api/users HTTP/1.1" 200 1234
192.168.1.101 - - [28/Jan/2025:12:00:01 +0000] "POST /api/login HTTP/1.1" 404 567
1. Count Top Visiting IPs
1 | awk '{print$1}' access.log | sort | uniq -c | sort -nr | head -10 |
Breakdown: 1. awk '{print$1}': Extract IP address
(column 1) 2. sort: Sort (make identical IPs adjacent) 3.
uniq -c: Remove duplicates and count occurrences 4.
sort -nr: Sort by count in reverse order (-n numeric, -r
reverse) 5. head -10: Show only top 10
2. Count Most Visited URLs
1 | awk '{print$7}' access.log | sort | uniq -c | sort -nr | head -10 |
$7is request path (like/api/users)
3. Count Each Status Code's Occurrences
1 | awk '{print$9}' access.log | sort | uniq -c | sort -nr |
$9is status code (like 200, 404, 500)
Example output: 1
2
31234 200
567 404
123 500
4. Find Errors in Last Hour
1 | grep "28/Jan/2025:12:" access.log | grep -E " (4|5)[0-9]{2} " | tail -n 100 |
- First
grepfilters time - Second
grepfilters 4xx and 5xx status codes tail -n 100shows only last 100 lines
xargs: Batch File Processing
xargs converts previous command's output (usually file
list) into arguments, passing to next command.
Why xargs Is Needed
Problem: Some commands (like rm,
cp, mv) don't support reading arguments from
stdin.
1 | find . -name "*.tmp" # Outputs file list |
Solution: Use xargs to convert file
list to arguments 1
find . -name "*.tmp" | xargs rm # ✅ Correct
Basic Usage
1 | echo "file1 file2 file3" | xargs rm # Delete three files |
Advanced Usage:
-i and Placeholder {}
1 | find . -name "*.log" | xargs -i cp {} {}.bak # Copy each file with .bak backup |
-i: Enable placeholder{}{}: Represents each input filename{}.bak: Add.bakto filename
Handle Filenames with Spaces (Important!)
Problem: Filenames with spaces cause
xargs to treat them as multiple arguments.
Wrong example: 1
find . -name "*.txt" | xargs rm # If filename is "my file.txt", treated as "my" and "file.txt"
Correct approach: Use find -print0 +
xargs -0 1
find . -name "*.txt" -print0 | xargs -0 rm
-print0: Use null character (\0) to separate filenames (instead of newline)-0: xargs uses null character as delimiter
Or use find -exec (simpler):
1
find . -name "*.txt" -exec rm {} +
Practical Cases: Batch File Operations
Case 1: Batch Rename Files
Suppose you have files img_001.jpg,
img_002.jpg, want to rename to photo_001.jpg,
photo_002.jpg.
1 | for file in img_*.jpg; do |
Or use rename command (needs installation):
1
rename 's/img/photo/' img_*.jpg
Case 2: Batch Modify File Permissions
1 | find /var/www/html -type f -exec chmod 644 {} + # Files to 644 |
Case 3: Batch Delete Empty Files
1 | find /tmp -type f -empty -delete # Delete all empty files |
Case 4: Batch Compress Log Files
1 | find /var/log -name "*.log" -mtime +7 -exec gzip {} \; |
-mtime +7: Files modified more than 7 days ago-exec gzip {} \;: Execute gzip compression on each file
Advanced Techniques
Process Substitution
Syntax: <(command)
Purpose: Treat command output as a temporary file.
Example: Compare two sorted files (without creating
temp files) 1
diff <(sort file1.txt) <(sort file2.txt)
Equivalent to: 1
2
3
4sort file1.txt > /tmp/sorted1
sort file2.txt > /tmp/sorted2
diff /tmp/sorted1 /tmp/sorted2
rm /tmp/sorted1 /tmp/sorted2
Parallel Processing (xargs -P)
If you have multiple CPU cores, can process multiple files in parallel.
1 | find . -name "*.json" -print0 | xargs -0 -P 8 -n 1 jq -c . > /dev/null |
-P 8: Run max 8 processes simultaneously-n 1: Pass 1 argument to command each time
Safety and Best Practices
1. Never Parse ls
Output
Wrong example: 1
ls | xargs rm # ❌ Filenames with spaces will error
Correct approach: 1
find . -maxdepth 1 -type f -print0 | xargs -0 rm
2. Preview Before Deletion
1 | find . -name "*.tmp" -print # First see which files to delete |
3. Use
set -e and set -o pipefail (In Scripts)
1 |
|
4. Importance of Quotes
Wrong example: 1
2dir="my documents"
rm -rf$dir # ❌ Will delete "my" and "documents" two directories
Correct approach: 1
rm -rf "$dir" # ✅ Correctly deletes "my documents" directory
Summary and Further Reading
This article covers the core content of Linux file operations and
pipes: 1. ✅ Data flow model (stdin/stdout/stderr, file descriptors) 2.
✅ Redirection (>, >>,
2>, 2>&1, <) 3. ✅
Pipe operator (| principles and debugging techniques) 4. ✅
Text processing toolchain (grep/awk/sed/cut/tr/sort/uniq) 5. ✅
Practical cases (Nginx log analysis, batch file operations) 6. ✅
Correct xargs usage (handling spaces, parallel processing) 7. ✅ Safety
and best practices (don't parse ls, preview before delete, correct
quoting)
Further Reading:
- The Art of Command Line: Command-line tips encyclopedia
man bash: View detailed Bash manual (redirection, pipes, etc.)man 1 awk: View detailed awk manual
Next Steps:
- 《 Linux User Management 》: Learn how to manage
users/groups,
/etc/passwd,/etc/shadow, sudo configuration
By this point, you should have upgraded from "can use pipes" to "can write readable debuggable one-liners, can quickly analyze logs, can safely batch-process files." Pipes and text processing are core Linux capabilities; mastering them makes you much more efficient at ops tasks.
- Post title:Linux Pipelines and Text Processing: Composing Tools into Data Flows
- Post author:Chen Kai
- Create time:2023-01-06 00:00:00
- Post link:https://www.chenk.top/en/linux-pipelines/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.