In production troubleshooting, the most critical skill isn't "memorizing commands" but quickly mapping symptoms to resources and processes: is CPU maxed out, is memory being consumed by cache, is disk I/O blocking, and exactly which process/file/port is slowing down the system. This post starts from basic concepts of processes/threads and parent-child relationships, explains Linux's resource perspective (especially the meaning of buffer/cache and the "out of memory" misjudgment), then systematically organizes a commonly used monitoring and locating toolchain (top/htop/ps/pstree/lsof, ports/network, I/O, load and stress testing). Then it fills in the "process control" operations: signals and background tasks, nice/renice priority, orphan/zombie process causes and handling; finally, using a complete troubleshooting case (what to do when Nginx log files are accidentally deleted) to apply the "resource perspective" to practical scenarios, helping you run through a complete troubleshooting workflow. If you're a sysadmin or need to troubleshoot performance issues, this article will upgrade you from "can view top" to "can quickly locate resource bottlenecks, can optimize process priorities, can handle abnormal process states."

Basic Concepts: Process vs Program vs Thread

The Three Concepts Often Confused

Understanding the differences between these three is important for grasping Linux systems:

Concept	Definition	Metaphor
Program	Static executable file stored on disk (like `/usr/bin/vim`)	Architectural blueprint
Process	Running instance after program is loaded into memory (has PID, memory space, open files)	Construction site in progress (foreman)
Thread	Execution unit within a process (shares process memory but has independent execution flow)	Workers on site

Examples:

When you run vim myfile.txt, the vim program loads from disk into memory, creating a process responsible for editing myfile.txt.
The same program can start multiple processes simultaneously; for example, when opening multiple browser tabs, each tab might correspond to an independent process (or multiple threads).
A process can contain multiple threads; for example, a music player process might have two threads: one downloading music, one playing songs.

Why have threads?

Threads are lighter than processes (lower creation/destruction overhead)
Threads share process memory space (easier communication)
Multi-threading can fully utilize multi-core CPUs (parallel computing)

Five Key Process Characteristics

Independence: Each process has its own memory space and system resources, isolated from each other (process A's variables won't affect process B)
Concurrency: OS allows multiple processes to run simultaneously, achieving concurrent processing through multi-task scheduling
Dynamism: Processes continuously create, execute, terminate; state changes in real-time (OS operation is continuously creating and destroying processes)
Parent-Child Relationship: Processes are created by parent processes via fork() call, forming parent-child structure (PPID field indicates parent process)
Schedulability: OS uses scheduling algorithms (like time-slice rotation, priority scheduling) to determine process execution order

Process Parent-Child Relationship (PID and PPID)

Every process has two important IDs:

PID (Process ID): Process ID, uniquely identifies a process
PPID (Parent Process ID): Parent process ID, identifies the parent process that created this process

Example:

1	ps -ef \| grep bash

Example output:

1
2
3

UID   PID  PPID  C STIME TTY      TIME CMD
root  1234 1     0 12:00 ?        00:00:00 /bin/bash /usr/local/bin/startup.sh
user  5678 1234  0 12:05 pts/0    00:00:00 bash

PID 5678 process is bash, its PPID is 1234 (parent process is /bin/bash /usr/local/bin/startup.sh)
All processes can ultimately be traced back to PID 1 (systemd or init)

View process tree:

1	pstree -p # Display process parent-child relationships in tree structure (-p shows PID)

Linux Resource Management Overview: CPU/Memory/Disk/Network

Operations work revolves around hardware and software resources; properly managing these resources ensures system runs efficiently and stably.

Four Major Hardware Resource Categories

1. CPU Resources

Core count: Modern CPUs are typically multi-core (like 4-core, 8-core, 16-core)
Load: Number of processes waiting to execute (load average)
Utilization: Percentage of CPU occupied by processes

Check CPU core count:

1 2	lscpu \| grep '^CPU(s)' # Output: CPU(s): 4 nproc # Output core count: 4

Check CPU load:

1	uptime # Output: 06:56:12 up 12 days, 3:45, 3 users, load average: 0.22, 0.45, 0.56

load average interpretation (for 4-core CPU example):

load average: 0.22, 0.45, 0.56: Average loads for 1 minute, 5 minutes, 15 minutes
Load < core count (like load < 4 for 4-core): System idle
Load = core count (like load = 4 for 4-core): System at full capacity
Load > core count (like load > 4 for 4-core): System overloaded (processes waiting for CPU)

High load but low CPU usage? This usually indicates processes are waiting for I/O (disk read/write, network), not CPU bottleneck.

2. Memory Resources

Total memory: Total physical memory
Used memory: Allocated memory
Available memory: Actually available memory (includes reclaimable buffer/cache)
Swap: Swap space (virtual memory on hard drive, slow)

Check memory usage:

1	free -h # -h human-readable display (MB/GB)

Example output:

1
2
3

              total        used        free      shared  buff/cache   available
Mem:           15Gi       2.5Gi       8.0Gi       100Mi       4.5Gi        12Gi
Swap:         2.0Gi          0B       2.0Gi

Important concept: buffer and cache (detailed in next section)

3. Disk Resources

Capacity: Total storage space provided by hard drive or SSD
Read/Write Performance:
- Mechanical Hard Drive (HDD): Large capacity, low price, slow speed (100-200 MB/s)
- Solid State Drive (SSD): Fast speed, high price, relatively small capacity (500-3000 MB/s)
- NVMe SSD: Even faster (3000-7000 MB/s)

Check disk usage:

1
2
3

df -h  # View partition usage
lsblk  # List block devices and mount points
du -sh /*  # View space occupied by each directory under root

Check disk I/O:

1 2	iostat -x 1 # Refresh every second (requires sysstat package) iotop # Real-time view of each process's disk I/O (requires root privileges)

4. Network Resources

Bandwidth: Network interface maximum transfer rate (like 1 Gbps, 10 Gbps)
Throughput: Actual transfer rate
Latency: Packet round-trip time (RTT)

Check network traffic:

1 2	iftop -i eth0 # Real-time display of network traffic (needs iftop installation) ip -s link # View interface statistics (sent/received packets, dropped packets)

Check network connections:

1 2	ss -tulnp # View listening ports and connections (replaces netstat) lsof -i :80 # See which process is using port 80

Buffer and Cache Explained: Why "Out of Memory" Is Often a Misjudgment

Linux memory management is aggressive: use as much memory as possible to cache data, improving performance. So you'll find free shows very little free, but this doesn't mean you're out of memory.

Buffer vs Cache

Type	Function	Example
Buffer	Write buffer (temporary storage before data is written from memory to disk)	When writing files, data first stored in buffer, then batch-written to disk
Cache	Read cache (data read from disk cached in memory, next read directly from memory)	When reading files, content cached in cache, next read instant

Why this design?

Buffer: Reduces disk write operations. If every write went directly to disk, too slow (especially for lots of small files). First accumulate a batch of data, then write to disk all at once, much faster.
Cache: Reduces disk read operations. Frequently accessed files cached in memory, reading speed hundreds of times faster.

Important: Buffer and Cache are reclaimable. When programs need more memory, the kernel automatically releases buffer/cache to programs. So the available shown in free is the truly available memory (including reclaimable buffer/cache).

Misjudgment example:

1 2	total used free shared buff/cache available Mem: 15Gi 2.5Gi 1.0Gi 100Mi 11.5Gi 12Gi

Beginners see free only has 1.0Gi and think memory is running out. But actually available is 12Gi (because the 11.5Gi buff/cache is reclaimable).

When is memory truly running out?

available approaches 0
Swap usage is very high (indicates insufficient memory, starting to use hard drive as memory)
Processes are killed by OOM killer (kernel's out-of-memory killer)

Process Monitoring Toolchain: From Overview to Details

1. top: The "Swiss Army Knife" of Real-Time Monitoring

top is the most commonly used real-time monitoring tool, displaying CPU, memory, process info, etc.

Basic usage:

top

Interface interpretation:

top - 12:00:00 up 10 days,  3:45,  2 users,  load average: 1.23, 0.87, 0.45
Tasks: 150 total,   2 running, 148 sleeping,   0 stopped,   0 zombie
%Cpu(s):  5.2 us,  2.1 sy,  0.0 ni, 92.3 id,  0.3 wa,  0.0 hi,  0.1 si,  0.0 st
MiB Mem :  15872.0 total,   8234.5 free,   3456.2 used,   4181.3 buff/cache
MiB Swap:   2048.0 total,   2048.0 free,      0.0 used.  11234.5 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 1234 root      20   0  123456  12345   1234 R  50.0   0.8   1:23.45 python3
 5678 www-data  20   0  234567  23456   2345 S  10.0   1.5   0:12.34 nginx

Key metrics:

load average: 1/5/15-minute average load (approaching CPU core count means system at full capacity)
Tasks: Total processes, running/sleeping/stopped/zombie process counts
%Cpu(s):
- us (user): User-space CPU usage
- sy (system): Kernel-space CPU usage
- ni (nice): Low-priority process CPU usage
- id (idle): Idle CPU (higher is better)
- wa (wait): CPU time waiting for I/O (high indicates slow disk/network)
Mem/Swap: Memory and swap space usage

Common hotkeys:

P: Sort by CPU usage
M: Sort by memory usage
k: Enter PID to send signal to terminate process
1: Show each CPU core's usage rate
q: Quit

2. htop: Enhanced Version of top

htop is a colorful enhanced version of top, supporting mouse operations, tree view, direct process termination.

Install and use:

1
2
3

sudo apt install htop  # Debian/Ubuntu
sudo dnf install htop  # CentOS/RHEL
htop

Advantages:

Colorful interface, more intuitive
Supports mouse clicking to select processes
Displays process tree (F5 to toggle tree view)
Can directly select and terminate processes (F9 to send signal)

3. ps: Static Process Snapshot

ps provides a static snapshot of current processes (doesn't refresh in real-time like top).

Common usage:

1 2	ps -ef # Unix style, show all processes (-e) with full info (-f) ps aux # BSD style, show all processes (a) with user info (u) and background processes (x)

Output field interpretation (ps aux):

USER: Process owner
PID: Process ID
%CPU: CPU usage
%MEM: Memory usage
VSZ: Virtual memory size (total memory requested by process)
RSS: Resident memory size (actual physical memory occupied)
STAT: Process state
- R: Running
- S: Sleeping (waiting for event)
- D: Uninterruptible sleep (usually waiting for disk I/O)
- Z: Zombie (exited but not reaped by parent)
- T: Stopped (usually paused by Ctrl+Z)
TIME: Process cumulative CPU time
COMMAND: Process command

Advanced usage:

1
2
3

ps -ef | grep nginx  # View nginx-related processes
ps -ef | grep -v grep  # Remove grep's own process
ps -eo pid,ppid,cmd,%cpu,%mem --sort=-%cpu | head -10  # Sort by CPU usage, show top 10

4. pstree: Process Tree

pstree displays process parent-child relationships in tree structure, helping understand process hierarchy.

Basic usage:

1 2	pstree -p # -p displays PID pstree -ap # -a displays command parameters

5. lsof: View Open Files

lsof (List Open Files) lists all open files in the system, including regular files, network connections, devices, etc.

Why use lsof?

View which files a process has opened (like config files, log files, database files)
View which process is using a port (like who's using port 80)
View which process is using a file (like if a file can't be deleted, might be in use by a process)
Recover accidentally deleted files (if process is still running, file handle still exists, can recover via /proc/<pid>/fd/)

Common usage:

lsof  # List all open files (very long output)
lsof -p <PID>  # View files opened by specified process
lsof -u <user>  # View files opened by specified user
lsof -c <command>  # View files opened by specified command
lsof -i :80  # See which process is using port 80
lsof -i tcp  # View all TCP connections
lsof +D /var/log  # View files opened under /var/log directory
lsof +L1  # View files with link count < 1 (usually deleted but still occupied by process)

Example: View all files opened by nginx

1	lsof -c nginx

Example output:

COMMAND   PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
nginx    1234     root  cwd    DIR    8,1     4096    2 /
nginx    1234     root  txt    REG    8,1   123456  789 /usr/sbin/nginx
nginx    1234     root    1w   REG    8,1    12345 1011 /var/log/nginx/access.log
nginx    1234     root    2w   REG    8,1     6789 1012 /var/log/nginx/error.log
nginx    1234     root    6u  IPv4  12345      0t0  TCP *:80 (LISTEN)

FD: File descriptor (cwd is current directory, txt is program file, 1w is stdout, 2w is stderr, 6u is open socket)
TYPE: Type (DIR is directory, REG is regular file, IPv4 is network socket)

6. Network Port Monitoring

ss: View Network Connections (Replaces netstat)

1	ss -tulnp # View listening ports and connections

-t: TCP connections
-u: UDP connections
-l: Listening ports (LISTEN state)
-n: Display numeric addresses and ports (don't resolve hostnames)
-p: Display process PID and name

Example: Check port 80 listening status

1	ss -tulnp \| grep :80

Example output:

1	tcp LISTEN 0 128 :80 :* users:(("nginx",pid=1234,fd=6))

lsof: View Port Usage

1 2	lsof -i :80 # See which process is using port 80 lsof -i tcp # View all TCP connections

7. Disk I/O Monitoring

iostat: Disk I/O Statistics

1	iostat -x 1 # Refresh every second, show extended info

Key metrics:

%util: Disk utilization (approaching 100% means disk is very busy)
await: Average wait time (milliseconds)
r/s, w/s: Read/write operations per second

iotop: Real-Time View of Process Disk I/O

1	sudo iotop -o # -o only shows processes with I/O

Process Control: Signals, Background Tasks, Priorities

1. kill: Send Signals

kill isn't just "kill process"; its essence is sending signals to processes.

Common signals:

Signal Number	Signal Name	Function	Example
1	SIGHUP	Reload config (don't terminate process)	`kill -1 <PID>`
2	SIGINT	Interrupt (equivalent to Ctrl+C)	`kill -2 <PID>`
9	SIGKILL	Force terminate (process can't catch, immediate termination)	`kill -9 <PID>`
15	SIGTERM	Gentle terminate (process can catch, cleanup then exit)	`kill <PID>` (default)
20	SIGTSTP	Pause (equivalent to Ctrl+Z)	`kill -20 <PID>`

Best practices: 1. First use kill <PID> (SIGTERM), give process chance to cleanup (like saving data, closing connections) 2. If process doesn't respond, then use kill -9 <PID> (SIGKILL) to force terminate

Example: Reload nginx config (without stopping service)

1
2
3

sudo kill -1 $(pidof nginx | awk '{print$1}')  # Send SIGHUP signal
# Or
sudo nginx -s reload  # Nginx's convenient command

2. Background Task Running

Method 1: Use `&`

1	./long_task.sh & # Run in background (but will be terminated after exiting SSH)

Method 2: Use `nohup` (Recommended)

1	nohup ./long_task.sh & # Run in background, continues after exiting SSH

nohup: No Hangup, ignores SIGHUP signal (signal sent when SSH disconnects)
Output defaults to redirecting to nohup.out

Better way:

1	nohup ./long_task.sh > /dev/null 2>&1 & # Discard output, don't save to file

Method 3: Use `screen` or `tmux` (Best Practice)

screen -S mysession  # Create a screen session
./long_task.sh  # Run task in screen
# Press Ctrl+A+D to detach session (task continues running)
# After exiting SSH, task still runs

screen -r mysession  # Reconnect to session

Manage Background Tasks

1
2
3

jobs  # View background tasks
fg %1  # Bring task 1 to foreground
bg %1  # Continue task 1 in background (usually used after Ctrl+Z pause)

3. Adjust Process Priority (nice/renice)

Linux uses nice values to control process priority:

nice value range: -20 (highest priority) to 19 (lowest priority)
Default nice value: 0
Lower nice value = higher priority (easier to grab CPU)

Specify Priority at Startup (nice)

1 2	nice -n 10 ./cpu_intensive_task.sh # Start with nice value 10 (lower priority) nice -n -10 ./important_task.sh # Start with nice value -10 (higher priority, needs root)

Adjust Running Process Priority (renice)

1 2	renice -n 10 -p <PID> # Change process PID's nice value to 10 renice -n -5 -p <PID> # Increase priority (needs root)

Use cases:

Background backup tasks: Start with nice -n 19, don't affect normal business
Critical business processes: Use renice -n -10 to increase priority

Special Process States: Orphan and Zombie Processes

Orphan Process

Definition: After parent process exits, child process is adopted by PID 1 (systemd or init).

Example code (Python):

import os
import time

def child_process():
    print(f"Child: PID={os.getpid()}, PPID={os.getppid()}")
    time.sleep(3)  # Wait for parent to exit
    print(f"Child after parent exit: PID={os.getpid()}, PPID={os.getppid()}")

if __name__ == "__main__":
    pid = os.fork()
    if pid > 0:
        # Parent process
        print(f"Parent: PID={os.getpid()}, Child PID={pid}")
        os._exit(0)  # Parent exits immediately
    else:
        # Child process
        child_process()

Example output:

1
2
3

Parent: PID=1234, Child PID=1235
Child: PID=1235, PPID=1234
Child after parent exit: PID=1235, PPID=1  # PPID becomes 1 (adopted by systemd)

Are orphan processes harmful? Not necessarily. systemd adopts orphan processes and manages them normally.

Zombie Process

Definition: Process has exited, but parent hasn't called wait() to reap its exit status, causing process info to remain in process table.

Characteristics:

Doesn't occupy CPU or memory (already exited)
But occupies process table entry (too many can exhaust system process table)
State shows as Z (Zombie)

View zombie processes:

1	ps aux \| grep ' Z '

Example code (Python):

import os
import time

if __name__ == '__main__':
    pid = os.fork()
    if pid > 0:
        # Parent process
        print(f"Parent: PID={os.getpid()}, Child PID={pid} (will become zombie)")
        time.sleep(15)  # Parent pauses 15 seconds, child exits but not reaped (zombie state)
        os.wait()  # Reap zombie process
        print("Zombie child has been reaped.")
    else:
        # Child process
        print(f"Child: PID={os.getpid()}, PPID={os.getppid()}")
        os._exit(0)  # Child exits immediately, enters zombie state

How to resolve zombie processes? 1. Make parent call wait() (if parent is your program, fix the code) 2. Kill parent process (after parent exits, child is adopted by systemd and reaped) 3. Reboot system (last resort)

Hands-On: Complete Performance Troubleshooting Workflow

Scenario: System Slowed Down, How to Troubleshoot

1. Check Overall Load

1 2	uptime # Check load average top # Real-time view of CPU, memory, processes

Diagnosis:

load average high? Possibly CPU maxed or I/O slow
CPU idle low? CPU bottleneck
CPU wa high? Disk I/O slow

2. Find Resource-Consuming Processes

1
2
3

top  # Press P to sort by CPU, M to sort by memory
ps aux --sort=-%cpu | head -10  # View top 10 CPU-consuming processes
ps aux --sort=-%mem | head -10  # View top 10 memory-consuming processes

3. View Process Details

1
2
3

lsof -p <PID>  # View files opened by process
ls -l /proc/<PID>/fd/  # View process file descriptors
cat /proc/<PID>/status  # View detailed process status

4. Check Disk I/O

1 2	iostat -x 1 # View disk I/O sudo iotop -o # See which process is reading/writing disk

5. Check Network Connections

1 2	ss -tulnp # View listening ports lsof -i # View all network connections

6. Optimize or Terminate Process

1
2
3

renice -n 10 -p <PID>  # Lower process priority
kill <PID>  # Gentle terminate
kill -9 <PID>  # Force terminate

Real Case: What to Do When Nginx Log File Is Accidentally Deleted

Scenario

An ops person accidentally ran rm -rf /var/log/nginx/access.log, but nginx process is still running.

Problem

Although file was deleted, nginx process still holds file handle (under /proc/<pid>/fd/), continues writing data to "deleted file". At this point:

df -h shows disk usage hasn't decreased (because file still occupies space)
ls /var/log/nginx/ doesn't show access.log (because directory entry was deleted)

Solution

1. Find nginx Process PID

1	pidof nginx # Or ps aux \| grep nginx

Suppose main process PID is 1234.

2. View Files Opened by Process

1	lsof -p 1234 \| grep access.log

Example output:

1	nginx 1234 root 6w REG 8,1 123456789 /var/log/nginx/access.log (deleted)

6w: File descriptor is 6, mode is w (write)
(deleted): File was deleted but process still holds handle

3. Recover File

1	sudo cp /proc/1234/fd/6 /var/log/nginx/access.log

4. Reload nginx

1	sudo nginx -s reload # Make nginx reopen log file

Principle:

Deleting file only deletes directory entry (filename); inode and data blocks remain (because process is still using it)
Via /proc/<pid>/fd/<fd> you can access files opened by process (even if deleted)
After cp copying file, reload nginx to make it reopen log file

Summary and Further Reading

This article covers the core content of Linux process and resource management: 1. ✅ Basic concepts of processes and programs (process vs program vs thread, parent-child relationships) 2. ✅ Linux resource management overview (CPU/memory/disk/network) 3. ✅ Buffer and Cache explained (why "out of memory" is often a misjudgment) 4. ✅ Process monitoring toolchain (top/htop/ps/pstree/lsof/ss/iostat) 5. ✅ Process control (kill signals, background tasks, priority adjustment) 6. ✅ Special process states (orphan processes, zombie processes) 7. ✅ Real cases (performance troubleshooting workflow, Nginx log recovery)

Further Reading:

Linux Performance (Brendan Gregg): http://www.brendangregg.com/linuxperf.html
man proc: View detailed explanation of /proc filesystem
man 7 signal: View explanation of all signals

Next Steps:

《 Linux Disk Management 》: Learn partitioning, formatting, mounting, LVM, RAID, etc.
《 Linux User Management 》: Learn how to manage users/groups/permissions

By this point, you should have upgraded from "can view top" to "can quickly locate resource bottlenecks, can optimize process priorities, can handle abnormal process states." Process and resource management is a core Linux ops skill; mastering it allows you to better troubleshoot performance issues.

Basic Concepts: Process vs Program vs Thread

The Three Concepts Often Confused

Five Key Process Characteristics

Process Parent-Child Relationship (PID and PPID)

Linux Resource Management Overview: CPU/Memory/Disk/Network

Four Major Hardware Resource Categories

1. CPU Resources

2. Memory Resources

3. Disk Resources

4. Network Resources

Buffer and Cache Explained: Why "Out of Memory" Is Often a Misjudgment

Buffer vs Cache

Process Monitoring Toolchain: From Overview to Details

1. top: The "Swiss Army Knife" of Real-Time Monitoring

2. htop: Enhanced Version of top

3. ps: Static Process Snapshot

4. pstree: Process Tree

5. lsof: View Open Files

6. Network Port Monitoring

ss: View Network Connections (Replaces netstat)

lsof: View Port Usage

7. Disk I/O Monitoring

iostat: Disk I/O Statistics

iotop: Real-Time View of Process Disk I/O

Process Control: Signals, Background Tasks, Priorities

1. kill: Send Signals

2. Background Task Running

Method 1: Use &

Method 2: Use nohup (Recommended)

Method 3: Use screen or tmux (Best Practice)

Manage Background Tasks

3. Adjust Process Priority (nice/renice)

Specify Priority at Startup (nice)

Adjust Running Process Priority (renice)

Special Process States: Orphan and Zombie Processes

Orphan Process

Zombie Process

Hands-On: Complete Performance Troubleshooting Workflow

Scenario: System Slowed Down, How to Troubleshoot

1. Check Overall Load

2. Find Resource-Consuming Processes

3. View Process Details

4. Check Disk I/O

5. Check Network Connections

6. Optimize or Terminate Process

Real Case: What to Do When Nginx Log File Is Accidentally Deleted

Scenario

Problem

Solution

1. Find nginx Process PID

2. View Files Opened by Process

3. Recover File

4. Reload nginx

Summary and Further Reading

Method 1: Use `&`

Method 2: Use `nohup` (Recommended)

Method 3: Use `screen` or `tmux` (Best Practice)