Linux Pipelines and Text Processing: Composing Tools into Data Flows
Chen Kai BOSS

The real productivity jump on Linux isn't memorizing more commands — it's learning how to compose small tools into clear data flows. The pipe operator | embodies the core Unix philosophy: make each small tool do one thing (grep only filters, awk only extracts fields, sort only sorts), then chain them into a readable, debuggable pipeline. This post starts from the data flow model (stdin/stdout/stderr), systematically explains semantic differences between pipes and redirection (>, >>, 2>, 2>&1, < each do what), then fills in typical patterns for log triage, text filtering, statistical aggregation, and batch processing (when to use grep/awk/sed/sort/uniq/wc/cut/tr, how to progressively narrow scope), and uses practical cases (Nginx log analysis, batch file operations, safe deletion) to cover pitfalls like "spaces and newlines" (correct usage of find -print0 + xargs -0). After reading, you should be able to replace many "need to write a script" small tasks with one or two readable command lines and more easily understand others' one-liners.

Data Flow Model: stdin/stdout/stderr and File Descriptors

Three Standard Streams

Every Linux process has three standard streams:

Stream File Descriptor Default Behavior Example
stdin 0 Read input from keyboard cat (waits for user input when no args)
stdout 1 Output to screen echo "hello"
stderr 2 Error output to screen ls /nonexistent 2>&1

Why separate stdout and stderr?

  • Normal output and error output can be handled separately (like normal output saved to file, error output displayed on screen)
  • Pipe | only passes stdout (doesn't pass stderr), so error messages don't pollute data flow

Example:

1
2
3
ls /nonexistent  # Error message outputs to stderr (screen)
ls /nonexistent 2> err.log # Error message redirected to err.log
ls /nonexistent 2>&1 # stderr redirected to stdout (merged into same stream)

File Descriptors (FD)

File descriptors are process's "handles" for opened files, represented by integers:

  • 0: stdin
  • 1: stdout
  • 2: stderr
  • 3+: Files opened by process itself

View process's open file descriptors:

1
ls -l /proc/$$/fd  #$$is current shell's PID

Example output:

1
2
3
lrwx------ 1 user user 0 /proc/12345/fd/0 -> /dev/pts/0  # stdin
lrwx------ 1 user user 0 /proc/12345/fd/1 -> /dev/pts/0 # stdout
lrwx------ 1 user user 0 /proc/12345/fd/2 -> /dev/pts/0 # stderr


Redirection: Controlling Data Flow Direction

Output Redirection (stdout)

1
2
echo "hello" > file.txt  # Overwrite (file cleared if exists)
echo "world" >> file.txt # Append (add to end of file)

Common usage:

1
2
ls -l > filelist.txt  # Save file list
date >> log.txt # Append timestamp to log

Error Output Redirection (stderr)

1
2
ls /nonexistent 2> err.log  # Error output redirected to err.log
ls /nonexistent 2>> err.log # Error output appended to err.log

Redirect Both stdout and stderr

Method 1: 2>&1 (Traditional)

1
command > output.log 2>&1  # Both stdout and stderr redirected to output.log

Order matters:

  • > output.log first redirects stdout to output.log
  • 2>&1 then redirects stderr to stdout's location (also output.log)

Wrong way:

1
command 2>&1 > output.log  # Wrong! stderr first redirects to stdout (screen), then stdout redirects to file

1
2
command &> output.log  # Both stdout and stderr redirected to output.log
command &>> output.log # Append mode

Discard Output (/dev/null)

/dev/null is a special "black hole" file; data written to it is discarded.

1
2
3
command > /dev/null  # Discard stdout
command 2> /dev/null # Discard stderr
command &> /dev/null # Discard both stdout and stderr

Use cases:

  • Don't want to see command output (like scripts in cron jobs)
  • Only care if command succeeded (check exit code via $?)

Input Redirection (stdin)

1
sort < input.txt  # Read input from input.txt

Here-document (multi-line input):

1
2
3
4
5
cat <<EOF > config.txt
line 1
line 2
line 3
EOF

Here-string (single-line input):

1
grep "error" <<< "ERROR: something bad"


Pipe Operator: Chaining Commands

Core Concept of Pipes

Unix Philosophy: Each tool does one thing, does it well, then combine them via pipes.

Example:

1
cat access.log | grep "404" | wc -l

Breakdown: 1. cat access.log: Output log content (stdout) 2. grep "404": Read from stdin, filter lines containing "404" (stdout) 3. wc -l: Read from stdin, count lines (stdout)

Why this design?

  • Avoids temporary files (data flows in memory, not written to disk)
  • Strong readability (each step is clear)
  • Easy debugging (can add pipes step by step, see each step's output)

Debugging Pipes: Using tee

tee can output data to both screen and file simultaneously (like a "T-junction pipe").

1
cat access.log | grep "404" | tee filtered.log | wc -l
  • tee filtered.log: Saves grep output to filtered.log, while passing to next command
  • This lets you see intermediate results, helpful for debugging

Text Processing Toolchain: grep/awk/sed/cut/tr/sort/uniq

grep: Filter Lines

grep is the most commonly used text filtering tool for finding matching lines.

Basic usage:

1
2
grep "pattern" file  # Find lines matching pattern in file
command | grep "pattern" # Find in command output

Common parameters:

  • -i: Ignore case
  • -v: Invert match (only show lines NOT containing pattern)
  • -n: Show line numbers
  • -A N: Show N lines after match (After)
  • -B N: Show N lines before match (Before)
  • -C N: Show N lines before and after match (Context)
  • -E: Extended regex (supports |, +, ?, etc.)
  • -r: Recursively search directory

Practical examples:

1. View Errors in Logs

1
2
grep -i "error" /var/log/syslog  # Case-insensitive search for error
grep -E "error|fail|timeout" /var/log/syslog # Search multiple keywords

2. View Error Context

1
grep -C 3 "OutOfMemoryError" app.log  # Show 3 lines before and after error

3. Recursively Search Directory

1
2
grep -r "TODO" /srv/project  # Recursively find TODO in project directory
grep -rn "import numpy" /srv/project # Find and show line numbers

4. Count Matches

1
2
grep -c "ERROR" app.log  # Count lines containing ERROR
grep "ERROR" app.log | wc -l # Same (more common)

awk: Extract Fields and Aggregate

awk is a powerful tool for processing columnar text (like logs, CSV, tables).

Basic concepts:

  • awk processes text line by line, splitting each line by whitespace (or specified delimiter) into fields
  • $1 is first field, $2 is second field, $0 is entire line

Common examples:

1. Extract Fields

1
2
3
4
# Nginx log format: IP - - [time] "GET /path HTTP/1.1" 200 1234
awk '{print$1}' access.log # Extract IP address (column 1)
awk '{print$7}' access.log # Extract request path (column 7)
awk '{print$9}' access.log # Extract status code (column 9)

2. Filter Lines (Like grep)

1
2
awk '/404/ {print$0}' access.log  # Only show lines containing 404
awk '$9 >= 400 {print$0}' access.log # Only show lines with status code >= 400

3. Statistics and Aggregation

1
2
3
4
5
# Count occurrences of each status code
awk '{count[$9]++} END {for (k in count) print k, count[k]}' access.log

# Count requests per IP
awk '{count[$1]++} END {for (k in count) print k, count[k]}' access.log | sort -nr -k2

4. Custom Delimiter

1
2
# Comma-separated CSV file
awk -F',' '{print$2}' data.csv # -F specifies delimiter

sed: Text Replacement and Editing

sed is a stream editor for text replacement, deletion, insertion, etc.

Common examples:

1. Replace Text

1
2
3
sed 's/foo/bar/' file.txt  # Replace first foo with bar on each line
sed 's/foo/bar/g' file.txt # Replace all foo with bar on each line (g=global)
sed 's/foo/bar/gi' file.txt # Replace ignoring case

2. Delete Lines

1
2
3
sed '/pattern/d' file.txt  # Delete lines containing pattern
sed '/^$/d' file.txt # Delete empty lines
sed '1,10d' file.txt # Delete first 10 lines

3. Insert and Append

1
2
sed '1i\First Line' file.txt  # Insert text before first line
sed '$a\Last Line' file.txt # Append text after last line

cut/tr/sort/uniq: Simple Efficient Text Tools

cut: Extract Fields (Simple Cases)

1
2
cut -d',' -f1 data.csv  # Extract comma-separated first column
cut -d':' -f1,7 /etc/passwd # Extract username and shell (columns 1 and 7)

tr: Character Replacement/Deletion

1
2
3
echo "HELLO" | tr 'A-Z' 'a-z'  # Convert to lowercase
echo "a b c" | tr ' ' '\n' # Replace spaces with newlines
echo "abc123" | tr -d '0-9' # Delete numbers

sort: Sorting

1
2
3
4
5
sort file.txt  # Sort alphabetically
sort -n file.txt # Sort numerically
sort -r file.txt # Reverse sort
sort -k2 file.txt # Sort by second column
sort -u file.txt # Sort and remove duplicates (equivalent to sort + uniq)

uniq: Remove Duplicates (Only Adjacent Duplicates)

1
2
3
sort file.txt | uniq  # Sort first, then remove duplicates
sort file.txt | uniq -c # Count occurrences of each line
sort file.txt | uniq -d # Only show duplicate lines

Important: uniq can only remove adjacent duplicate lines, so usually need to sort first.


Practical Case: Nginx Log Analysis

Suppose you have an Nginx log file access.log, each line formatted like:

1
2
192.168.1.100 - - [28/Jan/2025:12:00:00 +0000] "GET /api/users HTTP/1.1" 200 1234
192.168.1.101 - - [28/Jan/2025:12:00:01 +0000] "POST /api/login HTTP/1.1" 404 567

1. Count Top Visiting IPs

1
awk '{print$1}' access.log | sort | uniq -c | sort -nr | head -10

Breakdown: 1. awk '{print$1}': Extract IP address (column 1) 2. sort: Sort (make identical IPs adjacent) 3. uniq -c: Remove duplicates and count occurrences 4. sort -nr: Sort by count in reverse order (-n numeric, -r reverse) 5. head -10: Show only top 10

2. Count Most Visited URLs

1
awk '{print$7}' access.log | sort | uniq -c | sort -nr | head -10
  • $7 is request path (like /api/users)

3. Count Each Status Code's Occurrences

1
awk '{print$9}' access.log | sort | uniq -c | sort -nr
  • $9 is status code (like 200, 404, 500)

Example output:

1
2
3
1234 200
567 404
123 500

4. Find Errors in Last Hour

1
grep "28/Jan/2025:12:" access.log | grep -E " (4|5)[0-9]{2} " | tail -n 100
  • First grep filters time
  • Second grep filters 4xx and 5xx status codes
  • tail -n 100 shows only last 100 lines

xargs: Batch File Processing

xargs converts previous command's output (usually file list) into arguments, passing to next command.

Why xargs Is Needed

Problem: Some commands (like rm, cp, mv) don't support reading arguments from stdin.

1
2
find . -name "*.tmp"  # Outputs file list
find . -name "*.tmp" | rm # ❌ Wrong! rm doesn't read from stdin

Solution: Use xargs to convert file list to arguments

1
find . -name "*.tmp" | xargs rm  # ✅ Correct

Basic Usage

1
echo "file1 file2 file3" | xargs rm  # Delete three files

Advanced Usage: -i and Placeholder {}

1
find . -name "*.log" | xargs -i cp {} {}.bak  # Copy each file with .bak backup
  • -i: Enable placeholder {}
  • {}: Represents each input filename
  • {}.bak: Add .bak to filename

Handle Filenames with Spaces (Important!)

Problem: Filenames with spaces cause xargs to treat them as multiple arguments.

Wrong example:

1
find . -name "*.txt" | xargs rm  # If filename is "my file.txt", treated as "my" and "file.txt"

Correct approach: Use find -print0 + xargs -0

1
find . -name "*.txt" -print0 | xargs -0 rm

  • -print0: Use null character (\0) to separate filenames (instead of newline)
  • -0: xargs uses null character as delimiter

Or use find -exec (simpler):

1
find . -name "*.txt" -exec rm {} +


Practical Cases: Batch File Operations

Case 1: Batch Rename Files

Suppose you have files img_001.jpg, img_002.jpg, want to rename to photo_001.jpg, photo_002.jpg.

1
2
3
for file in img_*.jpg; do
mv "$file" "${file/img/photo}"
done

Or use rename command (needs installation):

1
rename 's/img/photo/' img_*.jpg

Case 2: Batch Modify File Permissions

1
2
find /var/www/html -type f -exec chmod 644 {} +  # Files to 644
find /var/www/html -type d -exec chmod 755 {} + # Directories to 755

Case 3: Batch Delete Empty Files

1
find /tmp -type f -empty -delete  # Delete all empty files

Case 4: Batch Compress Log Files

1
find /var/log -name "*.log" -mtime +7 -exec gzip {} \;
  • -mtime +7: Files modified more than 7 days ago
  • -exec gzip {} \;: Execute gzip compression on each file

Advanced Techniques

Process Substitution

Syntax: <(command)

Purpose: Treat command output as a temporary file.

Example: Compare two sorted files (without creating temp files)

1
diff <(sort file1.txt) <(sort file2.txt)

Equivalent to:

1
2
3
4
sort file1.txt > /tmp/sorted1
sort file2.txt > /tmp/sorted2
diff /tmp/sorted1 /tmp/sorted2
rm /tmp/sorted1 /tmp/sorted2

Parallel Processing (xargs -P)

If you have multiple CPU cores, can process multiple files in parallel.

1
find . -name "*.json" -print0 | xargs -0 -P 8 -n 1 jq -c . > /dev/null
  • -P 8: Run max 8 processes simultaneously
  • -n 1: Pass 1 argument to command each time

Safety and Best Practices

1. Never Parse ls Output

Wrong example:

1
ls | xargs rm  # ❌ Filenames with spaces will error

Correct approach:

1
find . -maxdepth 1 -type f -print0 | xargs -0 rm

2. Preview Before Deletion

1
2
find . -name "*.tmp" -print  # First see which files to delete
find . -name "*.tmp" -delete # Confirm correct then delete

3. Use set -e and set -o pipefail (In Scripts)

1
2
3
4
5
6
#!/bin/bash
set -e # Exit if any command fails
set -o pipefail # Exit if any command in pipeline fails

# Now if any step fails, script immediately exits
cat file.log | grep "error" | process_errors.sh

4. Importance of Quotes

Wrong example:

1
2
dir="my documents"
rm -rf$dir # ❌ Will delete "my" and "documents" two directories

Correct approach:

1
rm -rf "$dir"  # ✅ Correctly deletes "my documents" directory


Summary and Further Reading

This article covers the core content of Linux file operations and pipes: 1. ✅ Data flow model (stdin/stdout/stderr, file descriptors) 2. ✅ Redirection (>, >>, 2>, 2>&1, <) 3. ✅ Pipe operator (| principles and debugging techniques) 4. ✅ Text processing toolchain (grep/awk/sed/cut/tr/sort/uniq) 5. ✅ Practical cases (Nginx log analysis, batch file operations) 6. ✅ Correct xargs usage (handling spaces, parallel processing) 7. ✅ Safety and best practices (don't parse ls, preview before delete, correct quoting)

Further Reading:

  • The Art of Command Line: Command-line tips encyclopedia
  • man bash: View detailed Bash manual (redirection, pipes, etc.)
  • man 1 awk: View detailed awk manual

Next Steps:

  • 《 Linux User Management 》: Learn how to manage users/groups, /etc/passwd, /etc/shadow, sudo configuration

By this point, you should have upgraded from "can use pipes" to "can write readable debuggable one-liners, can quickly analyze logs, can safely batch-process files." Pipes and text processing are core Linux capabilities; mastering them makes you much more efficient at ops tasks.

  • Post title:Linux Pipelines and Text Processing: Composing Tools into Data Flows
  • Post author:Chen Kai
  • Create time:2023-01-06 00:00:00
  • Post link:https://www.chenk.top/en/linux-pipelines/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments