In production troubleshooting, the most critical skill isn't
"memorizing commands" but quickly mapping symptoms to resources and
processes: is CPU maxed out, is memory being consumed by cache, is disk
I/O blocking, and exactly which process/file/port is slowing down the
system. This post starts from basic concepts of processes/threads and
parent-child relationships, explains Linux's resource perspective
(especially the meaning of buffer/cache and the "out of memory"
misjudgment), then systematically organizes a commonly used monitoring
and locating toolchain (top/htop/ps/pstree/lsof,
ports/network, I/O, load and stress testing). Then it fills in the
"process control" operations: signals and background tasks,
nice/renice priority, orphan/zombie process causes and
handling; finally, using a complete troubleshooting case (what to do
when Nginx log files are accidentally deleted) to apply the "resource
perspective" to practical scenarios, helping you run through a complete
troubleshooting workflow. If you're a sysadmin or need to troubleshoot
performance issues, this article will upgrade you from "can view top" to
"can quickly locate resource bottlenecks, can optimize process
priorities, can handle abnormal process states."
Basic Concepts: Process vs Program vs Thread
The Three Concepts Often Confused
Understanding the differences between these three is important for grasping Linux systems:
| Concept | Definition | Metaphor |
|---|---|---|
| Program | Static executable file stored on disk (like
/usr/bin/vim) |
Architectural blueprint |
| Process | Running instance after program is loaded into memory (has PID, memory space, open files) | Construction site in progress (foreman) |
| Thread | Execution unit within a process (shares process memory but has independent execution flow) | Workers on site |
Examples:
- When you run
vim myfile.txt, thevimprogram loads from disk into memory, creating a process responsible for editingmyfile.txt. - The same program can start multiple processes simultaneously; for example, when opening multiple browser tabs, each tab might correspond to an independent process (or multiple threads).
- A process can contain multiple threads; for example, a music player process might have two threads: one downloading music, one playing songs.
Why have threads?
- Threads are lighter than processes (lower creation/destruction overhead)
- Threads share process memory space (easier communication)
- Multi-threading can fully utilize multi-core CPUs (parallel computing)
Five Key Process Characteristics
- Independence: Each process has its own memory space and system resources, isolated from each other (process A's variables won't affect process B)
- Concurrency: OS allows multiple processes to run simultaneously, achieving concurrent processing through multi-task scheduling
- Dynamism: Processes continuously create, execute, terminate; state changes in real-time (OS operation is continuously creating and destroying processes)
- Parent-Child Relationship: Processes are created by
parent processes via
fork()call, forming parent-child structure (PPID field indicates parent process) - Schedulability: OS uses scheduling algorithms (like time-slice rotation, priority scheduling) to determine process execution order
Process Parent-Child Relationship (PID and PPID)
Every process has two important IDs:
- PID (Process ID): Process ID, uniquely identifies a process
- PPID (Parent Process ID): Parent process ID, identifies the parent process that created this process
Example: 1
ps -ef | grep bash
Example output: 1
2
3UID PID PPID C STIME TTY TIME CMD
root 1234 1 0 12:00 ? 00:00:00 /bin/bash /usr/local/bin/startup.sh
user 5678 1234 0 12:05 pts/0 00:00:00 bash
- PID 5678 process is
bash, its PPID is 1234 (parent process is/bin/bash /usr/local/bin/startup.sh) - All processes can ultimately be traced back to PID 1 (systemd or init)
View process tree: 1
pstree -p # Display process parent-child relationships in tree structure (-p shows PID)
Linux Resource Management Overview: CPU/Memory/Disk/Network
Operations work revolves around hardware and software resources; properly managing these resources ensures system runs efficiently and stably.
Four Major Hardware Resource Categories
1. CPU Resources
- Core count: Modern CPUs are typically multi-core (like 4-core, 8-core, 16-core)
- Load: Number of processes waiting to execute (load average)
- Utilization: Percentage of CPU occupied by processes
Check CPU core count: 1
2lscpu | grep '^CPU(s)' # Output: CPU(s): 4
nproc # Output core count: 4
Check CPU load: 1
uptime # Output: 06:56:12 up 12 days, 3:45, 3 users, load average: 0.22, 0.45, 0.56
load average interpretation (for 4-core CPU example):
load average: 0.22, 0.45, 0.56: Average loads for 1 minute, 5 minutes, 15 minutes- Load < core count (like load < 4 for 4-core): System idle
- Load = core count (like load = 4 for 4-core): System at full capacity
- Load > core count (like load > 4 for 4-core): System overloaded (processes waiting for CPU)
High load but low CPU usage? This usually indicates processes are waiting for I/O (disk read/write, network), not CPU bottleneck.
2. Memory Resources
- Total memory: Total physical memory
- Used memory: Allocated memory
- Available memory: Actually available memory (includes reclaimable buffer/cache)
- Swap: Swap space (virtual memory on hard drive, slow)
Check memory usage: 1
free -h # -h human-readable display (MB/GB)
Example output: 1
2
3 total used free shared buff/cache available
Mem: 15Gi 2.5Gi 8.0Gi 100Mi 4.5Gi 12Gi
Swap: 2.0Gi 0B 2.0Gi
Important concept: buffer and cache (detailed in next section)
3. Disk Resources
- Capacity: Total storage space provided by hard drive or SSD
- Read/Write Performance:
- Mechanical Hard Drive (HDD): Large capacity, low price, slow speed (100-200 MB/s)
- Solid State Drive (SSD): Fast speed, high price, relatively small capacity (500-3000 MB/s)
- NVMe SSD: Even faster (3000-7000 MB/s)
Check disk usage: 1
2
3df -h # View partition usage
lsblk # List block devices and mount points
du -sh /* # View space occupied by each directory under root
Check disk I/O: 1
2iostat -x 1 # Refresh every second (requires sysstat package)
iotop # Real-time view of each process's disk I/O (requires root privileges)
4. Network Resources
- Bandwidth: Network interface maximum transfer rate (like 1 Gbps, 10 Gbps)
- Throughput: Actual transfer rate
- Latency: Packet round-trip time (RTT)
Check network traffic: 1
2iftop -i eth0 # Real-time display of network traffic (needs iftop installation)
ip -s link # View interface statistics (sent/received packets, dropped packets)
Check network connections: 1
2ss -tulnp # View listening ports and connections (replaces netstat)
lsof -i :80 # See which process is using port 80
Buffer and Cache Explained: Why "Out of Memory" Is Often a Misjudgment
Linux memory management is aggressive: use as much memory as
possible to cache data, improving performance. So you'll find
free shows very little free, but this
doesn't mean you're out of memory.
Buffer vs Cache
| Type | Function | Example |
|---|---|---|
| Buffer | Write buffer (temporary storage before data is written from memory to disk) | When writing files, data first stored in buffer, then batch-written to disk |
| Cache | Read cache (data read from disk cached in memory, next read directly from memory) | When reading files, content cached in cache, next read instant |
Why this design?
- Buffer: Reduces disk write operations. If every write went directly to disk, too slow (especially for lots of small files). First accumulate a batch of data, then write to disk all at once, much faster.
- Cache: Reduces disk read operations. Frequently accessed files cached in memory, reading speed hundreds of times faster.
Important: Buffer and Cache are
reclaimable. When programs need more memory, the kernel
automatically releases buffer/cache to programs. So the
available shown in free is the truly available
memory (including reclaimable buffer/cache).
Misjudgment example: 1
2 total used free shared buff/cache available
Mem: 15Gi 2.5Gi 1.0Gi 100Mi 11.5Gi 12Gi
Beginners see free only has 1.0Gi and think memory is
running out. But actually available is 12Gi (because the
11.5Gi buff/cache is reclaimable).
When is memory truly running out?
availableapproaches 0- Swap usage is very high (indicates insufficient memory, starting to use hard drive as memory)
- Processes are killed by OOM killer (kernel's out-of-memory killer)
Process Monitoring Toolchain: From Overview to Details
1. top: The "Swiss Army Knife" of Real-Time Monitoring
top is the most commonly used real-time monitoring tool,
displaying CPU, memory, process info, etc.
Basic usage: 1
top
Interface interpretation: 1
2
3
4
5
6
7
8
9top - 12:00:00 up 10 days, 3:45, 2 users, load average: 1.23, 0.87, 0.45
Tasks: 150 total, 2 running, 148 sleeping, 0 stopped, 0 zombie
%Cpu(s): 5.2 us, 2.1 sy, 0.0 ni, 92.3 id, 0.3 wa, 0.0 hi, 0.1 si, 0.0 st
MiB Mem : 15872.0 total, 8234.5 free, 3456.2 used, 4181.3 buff/cache
MiB Swap: 2048.0 total, 2048.0 free, 0.0 used. 11234.5 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1234 root 20 0 123456 12345 1234 R 50.0 0.8 1:23.45 python3
5678 www-data 20 0 234567 23456 2345 S 10.0 1.5 0:12.34 nginx
Key metrics:
- load average: 1/5/15-minute average load (approaching CPU core count means system at full capacity)
- Tasks: Total processes, running/sleeping/stopped/zombie process counts
- %Cpu(s):
us(user): User-space CPU usagesy(system): Kernel-space CPU usageni(nice): Low-priority process CPU usageid(idle): Idle CPU (higher is better)wa(wait): CPU time waiting for I/O (high indicates slow disk/network)
- Mem/Swap: Memory and swap space usage
Common hotkeys:
P: Sort by CPU usageM: Sort by memory usagek: Enter PID to send signal to terminate process1: Show each CPU core's usage rateq: Quit
2. htop: Enhanced Version of top
htop is a colorful enhanced version of top,
supporting mouse operations, tree view, direct process termination.
Install and use: 1
2
3sudo apt install htop # Debian/Ubuntu
sudo dnf install htop # CentOS/RHEL
htop
Advantages:
- Colorful interface, more intuitive
- Supports mouse clicking to select processes
- Displays process tree (F5 to toggle tree view)
- Can directly select and terminate processes (F9 to send signal)
3. ps: Static Process Snapshot
ps provides a static snapshot of current processes
(doesn't refresh in real-time like top).
Common usage: 1
2ps -ef # Unix style, show all processes (-e) with full info (-f)
ps aux # BSD style, show all processes (a) with user info (u) and background processes (x)
Output field interpretation
(ps aux):
- USER: Process owner
- PID: Process ID
- %CPU: CPU usage
- %MEM: Memory usage
- VSZ: Virtual memory size (total memory requested by process)
- RSS: Resident memory size (actual physical memory occupied)
- STAT: Process state
R: RunningS: Sleeping (waiting for event)D: Uninterruptible sleep (usually waiting for disk I/O)Z: Zombie (exited but not reaped by parent)T: Stopped (usually paused by Ctrl+Z)
- TIME: Process cumulative CPU time
- COMMAND: Process command
Advanced usage: 1
2
3ps -ef | grep nginx # View nginx-related processes
ps -ef | grep -v grep # Remove grep's own process
ps -eo pid,ppid,cmd,%cpu,%mem --sort=-%cpu | head -10 # Sort by CPU usage, show top 10
4. pstree: Process Tree
pstree displays process parent-child relationships in
tree structure, helping understand process hierarchy.
Basic usage: 1
2pstree -p # -p displays PID
pstree -ap # -a displays command parameters
5. lsof: View Open Files
lsof (List Open Files) lists all open files in the
system, including regular files, network connections, devices, etc.
Why use lsof?
- View which files a process has opened (like config files, log files, database files)
- View which process is using a port (like who's using port 80)
- View which process is using a file (like if a file can't be deleted, might be in use by a process)
- Recover accidentally deleted files (if process is still running,
file handle still exists, can recover via
/proc/<pid>/fd/)
Common usage: 1
2
3
4
5
6
7
8lsof # List all open files (very long output)
lsof -p <PID> # View files opened by specified process
lsof -u <user> # View files opened by specified user
lsof -c <command> # View files opened by specified command
lsof -i :80 # See which process is using port 80
lsof -i tcp # View all TCP connections
lsof +D /var/log # View files opened under /var/log directory
lsof +L1 # View files with link count < 1 (usually deleted but still occupied by process)
Example: View all files opened by nginx
1
lsof -c nginx
Example output: 1
2
3
4
5
6COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
nginx 1234 root cwd DIR 8,1 4096 2 /
nginx 1234 root txt REG 8,1 123456 789 /usr/sbin/nginx
nginx 1234 root 1w REG 8,1 12345 1011 /var/log/nginx/access.log
nginx 1234 root 2w REG 8,1 6789 1012 /var/log/nginx/error.log
nginx 1234 root 6u IPv4 12345 0t0 TCP *:80 (LISTEN)
FD: File descriptor (cwdis current directory,txtis program file,1wis stdout,2wis stderr,6uis open socket)TYPE: Type (DIRis directory,REGis regular file,IPv4is network socket)
6. Network Port Monitoring
ss: View Network Connections (Replaces netstat)
1 | ss -tulnp # View listening ports and connections |
-t: TCP connections-u: UDP connections-l: Listening ports (LISTEN state)-n: Display numeric addresses and ports (don't resolve hostnames)-p: Display process PID and name
Example: Check port 80 listening status
1
ss -tulnp | grep :80
Example output: 1
tcp LISTEN 0 128 *:80 *:* users:(("nginx",pid=1234,fd=6))
lsof: View Port Usage
1 | lsof -i :80 # See which process is using port 80 |
7. Disk I/O Monitoring
iostat: Disk I/O Statistics
1 | iostat -x 1 # Refresh every second, show extended info |
Key metrics:
- %util: Disk utilization (approaching 100% means disk is very busy)
- await: Average wait time (milliseconds)
- r/s, w/s: Read/write operations per second
iotop: Real-Time View of Process Disk I/O
1 | sudo iotop -o # -o only shows processes with I/O |
Process Control: Signals, Background Tasks, Priorities
1. kill: Send Signals
kill isn't just "kill process"; its essence is
sending signals to processes.
Common signals:
| Signal Number | Signal Name | Function | Example |
|---|---|---|---|
| 1 | SIGHUP | Reload config (don't terminate process) | kill -1 <PID> |
| 2 | SIGINT | Interrupt (equivalent to Ctrl+C) | kill -2 <PID> |
| 9 | SIGKILL | Force terminate (process can't catch, immediate termination) | kill -9 <PID> |
| 15 | SIGTERM | Gentle terminate (process can catch, cleanup then exit) | kill <PID> (default) |
| 20 | SIGTSTP | Pause (equivalent to Ctrl+Z) | kill -20 <PID> |
Best practices: 1. First use
kill <PID> (SIGTERM), give process chance to cleanup
(like saving data, closing connections) 2. If process doesn't respond,
then use kill -9 <PID> (SIGKILL) to force
terminate
Example: Reload nginx config (without stopping
service) 1
2
3sudo kill -1 $(pidof nginx | awk '{print$1}') # Send SIGHUP signal
# Or
sudo nginx -s reload # Nginx's convenient command
2. Background Task Running
Method 1: Use &
1 | ./long_task.sh & # Run in background (but will be terminated after exiting SSH) |
Method 2: Use nohup
(Recommended)
1 | nohup ./long_task.sh & # Run in background, continues after exiting SSH |
nohup: No Hangup, ignores SIGHUP signal (signal sent when SSH disconnects)- Output defaults to redirecting to
nohup.out
Better way: 1
nohup ./long_task.sh > /dev/null 2>&1 & # Discard output, don't save to file
Method 3: Use
screen or tmux (Best Practice)
1 | screen -S mysession # Create a screen session |
Manage Background Tasks
1 | jobs # View background tasks |
3. Adjust Process Priority (nice/renice)
Linux uses nice values to control process priority:
- nice value range:
-20(highest priority) to19(lowest priority) - Default nice value:
0 - Lower nice value = higher priority (easier to grab CPU)
Specify Priority at Startup (nice)
1 | nice -n 10 ./cpu_intensive_task.sh # Start with nice value 10 (lower priority) |
Adjust Running Process Priority (renice)
1 | renice -n 10 -p <PID> # Change process PID's nice value to 10 |
Use cases:
- Background backup tasks: Start with
nice -n 19, don't affect normal business - Critical business processes: Use
renice -n -10to increase priority
Special Process States: Orphan and Zombie Processes
Orphan Process
Definition: After parent process exits, child process is adopted by PID 1 (systemd or init).
Example code (Python): 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17import os
import time
def child_process():
print(f"Child: PID={os.getpid()}, PPID={os.getppid()}")
time.sleep(3) # Wait for parent to exit
print(f"Child after parent exit: PID={os.getpid()}, PPID={os.getppid()}")
if __name__ == "__main__":
pid = os.fork()
if pid > 0:
# Parent process
print(f"Parent: PID={os.getpid()}, Child PID={pid}")
os._exit(0) # Parent exits immediately
else:
# Child process
child_process()
Example output: 1
2
3Parent: PID=1234, Child PID=1235
Child: PID=1235, PPID=1234
Child after parent exit: PID=1235, PPID=1 # PPID becomes 1 (adopted by systemd)
Are orphan processes harmful? Not necessarily. systemd adopts orphan processes and manages them normally.
Zombie Process
Definition: Process has exited, but parent hasn't
called wait() to reap its exit status, causing process info
to remain in process table.
Characteristics:
- Doesn't occupy CPU or memory (already exited)
- But occupies process table entry (too many can exhaust system process table)
- State shows as
Z(Zombie)
View zombie processes: 1
ps aux | grep ' Z '
Example code (Python): 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15import os
import time
if __name__ == '__main__':
pid = os.fork()
if pid > 0:
# Parent process
print(f"Parent: PID={os.getpid()}, Child PID={pid} (will become zombie)")
time.sleep(15) # Parent pauses 15 seconds, child exits but not reaped (zombie state)
os.wait() # Reap zombie process
print("Zombie child has been reaped.")
else:
# Child process
print(f"Child: PID={os.getpid()}, PPID={os.getppid()}")
os._exit(0) # Child exits immediately, enters zombie state
How to resolve zombie processes? 1. Make parent call
wait() (if parent is your program, fix the code) 2. Kill
parent process (after parent exits, child is adopted by systemd and
reaped) 3. Reboot system (last resort)
Hands-On: Complete Performance Troubleshooting Workflow
Scenario: System Slowed Down, How to Troubleshoot
1. Check Overall Load
1 | uptime # Check load average |
Diagnosis:
- load average high? Possibly CPU maxed or I/O slow
- CPU idle low? CPU bottleneck
- CPU wa high? Disk I/O slow
2. Find Resource-Consuming Processes
1 | top # Press P to sort by CPU, M to sort by memory |
3. View Process Details
1 | lsof -p <PID> # View files opened by process |
4. Check Disk I/O
1 | iostat -x 1 # View disk I/O |
5. Check Network Connections
1 | ss -tulnp # View listening ports |
6. Optimize or Terminate Process
1 | renice -n 10 -p <PID> # Lower process priority |
Real Case: What to Do When Nginx Log File Is Accidentally Deleted
Scenario
An ops person accidentally ran
rm -rf /var/log/nginx/access.log, but nginx process is
still running.
Problem
Although file was deleted, nginx process still holds file handle
(under /proc/<pid>/fd/), continues writing data to
"deleted file". At this point:
df -hshows disk usage hasn't decreased (because file still occupies space)ls /var/log/nginx/doesn't showaccess.log(because directory entry was deleted)
Solution
1. Find nginx Process PID
1 | pidof nginx # Or ps aux | grep nginx |
Suppose main process PID is 1234.
2. View Files Opened by Process
1 | lsof -p 1234 | grep access.log |
Example output: 1
nginx 1234 root 6w REG 8,1 123456789 /var/log/nginx/access.log (deleted)
6w: File descriptor is 6, mode isw(write)(deleted): File was deleted but process still holds handle
3. Recover File
1 | sudo cp /proc/1234/fd/6 /var/log/nginx/access.log |
4. Reload nginx
1 | sudo nginx -s reload # Make nginx reopen log file |
Principle:
- Deleting file only deletes directory entry (filename); inode and data blocks remain (because process is still using it)
- Via
/proc/<pid>/fd/<fd>you can access files opened by process (even if deleted) - After
cpcopying file, reload nginx to make it reopen log file
Summary and Further Reading
This article covers the core content of Linux process and resource management: 1. ✅ Basic concepts of processes and programs (process vs program vs thread, parent-child relationships) 2. ✅ Linux resource management overview (CPU/memory/disk/network) 3. ✅ Buffer and Cache explained (why "out of memory" is often a misjudgment) 4. ✅ Process monitoring toolchain (top/htop/ps/pstree/lsof/ss/iostat) 5. ✅ Process control (kill signals, background tasks, priority adjustment) 6. ✅ Special process states (orphan processes, zombie processes) 7. ✅ Real cases (performance troubleshooting workflow, Nginx log recovery)
Further Reading:
- Linux Performance (Brendan Gregg): http://www.brendangregg.com/linuxperf.html
man proc: View detailed explanation of/procfilesystemman 7 signal: View explanation of all signals
Next Steps:
- 《 Linux Disk Management 》: Learn partitioning, formatting, mounting, LVM, RAID, etc.
- 《 Linux User Management 》: Learn how to manage users/groups/permissions
By this point, you should have upgraded from "can view top" to "can quickly locate resource bottlenecks, can optimize process priorities, can handle abnormal process states." Process and resource management is a core Linux ops skill; mastering it allows you to better troubleshoot performance issues.
- Post title:Linux Process and Resource Management: Monitoring, Troubleshooting, and Optimization
- Post author:Chen Kai
- Create time:2023-01-04 00:00:00
- Post link:https://www.chenk.top/en/linux-process-resource-management/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.