Linux Process and Resource Management: Monitoring, Troubleshooting, and Optimization
Chen Kai BOSS

In production troubleshooting, the most critical skill isn't "memorizing commands" but quickly mapping symptoms to resources and processes: is CPU maxed out, is memory being consumed by cache, is disk I/O blocking, and exactly which process/file/port is slowing down the system. This post starts from basic concepts of processes/threads and parent-child relationships, explains Linux's resource perspective (especially the meaning of buffer/cache and the "out of memory" misjudgment), then systematically organizes a commonly used monitoring and locating toolchain (top/htop/ps/pstree/lsof, ports/network, I/O, load and stress testing). Then it fills in the "process control" operations: signals and background tasks, nice/renice priority, orphan/zombie process causes and handling; finally, using a complete troubleshooting case (what to do when Nginx log files are accidentally deleted) to apply the "resource perspective" to practical scenarios, helping you run through a complete troubleshooting workflow. If you're a sysadmin or need to troubleshoot performance issues, this article will upgrade you from "can view top" to "can quickly locate resource bottlenecks, can optimize process priorities, can handle abnormal process states."

Basic Concepts: Process vs Program vs Thread

The Three Concepts Often Confused

Understanding the differences between these three is important for grasping Linux systems:

Concept Definition Metaphor
Program Static executable file stored on disk (like /usr/bin/vim) Architectural blueprint
Process Running instance after program is loaded into memory (has PID, memory space, open files) Construction site in progress (foreman)
Thread Execution unit within a process (shares process memory but has independent execution flow) Workers on site

Examples:

  • When you run vim myfile.txt, the vim program loads from disk into memory, creating a process responsible for editing myfile.txt.
  • The same program can start multiple processes simultaneously; for example, when opening multiple browser tabs, each tab might correspond to an independent process (or multiple threads).
  • A process can contain multiple threads; for example, a music player process might have two threads: one downloading music, one playing songs.

Why have threads?

  • Threads are lighter than processes (lower creation/destruction overhead)
  • Threads share process memory space (easier communication)
  • Multi-threading can fully utilize multi-core CPUs (parallel computing)

Five Key Process Characteristics

  1. Independence: Each process has its own memory space and system resources, isolated from each other (process A's variables won't affect process B)
  2. Concurrency: OS allows multiple processes to run simultaneously, achieving concurrent processing through multi-task scheduling
  3. Dynamism: Processes continuously create, execute, terminate; state changes in real-time (OS operation is continuously creating and destroying processes)
  4. Parent-Child Relationship: Processes are created by parent processes via fork() call, forming parent-child structure (PPID field indicates parent process)
  5. Schedulability: OS uses scheduling algorithms (like time-slice rotation, priority scheduling) to determine process execution order

Process Parent-Child Relationship (PID and PPID)

Every process has two important IDs:

  • PID (Process ID): Process ID, uniquely identifies a process
  • PPID (Parent Process ID): Parent process ID, identifies the parent process that created this process

Example:

1
ps -ef | grep bash

Example output:

1
2
3
UID   PID  PPID  C STIME TTY      TIME CMD
root 1234 1 0 12:00 ? 00:00:00 /bin/bash /usr/local/bin/startup.sh
user 5678 1234 0 12:05 pts/0 00:00:00 bash

  • PID 5678 process is bash, its PPID is 1234 (parent process is /bin/bash /usr/local/bin/startup.sh)
  • All processes can ultimately be traced back to PID 1 (systemd or init)

View process tree:

1
pstree -p  # Display process parent-child relationships in tree structure (-p shows PID)


Linux Resource Management Overview: CPU/Memory/Disk/Network

Operations work revolves around hardware and software resources; properly managing these resources ensures system runs efficiently and stably.

Four Major Hardware Resource Categories

1. CPU Resources

  • Core count: Modern CPUs are typically multi-core (like 4-core, 8-core, 16-core)
  • Load: Number of processes waiting to execute (load average)
  • Utilization: Percentage of CPU occupied by processes

Check CPU core count:

1
2
lscpu | grep '^CPU(s)'  # Output: CPU(s): 4
nproc # Output core count: 4

Check CPU load:

1
uptime  # Output: 06:56:12 up 12 days, 3:45, 3 users, load average: 0.22, 0.45, 0.56

load average interpretation (for 4-core CPU example):

  • load average: 0.22, 0.45, 0.56: Average loads for 1 minute, 5 minutes, 15 minutes
  • Load < core count (like load < 4 for 4-core): System idle
  • Load = core count (like load = 4 for 4-core): System at full capacity
  • Load > core count (like load > 4 for 4-core): System overloaded (processes waiting for CPU)

High load but low CPU usage? This usually indicates processes are waiting for I/O (disk read/write, network), not CPU bottleneck.

2. Memory Resources

  • Total memory: Total physical memory
  • Used memory: Allocated memory
  • Available memory: Actually available memory (includes reclaimable buffer/cache)
  • Swap: Swap space (virtual memory on hard drive, slow)

Check memory usage:

1
free -h  # -h human-readable display (MB/GB)

Example output:

1
2
3
              total        used        free      shared  buff/cache   available
Mem: 15Gi 2.5Gi 8.0Gi 100Mi 4.5Gi 12Gi
Swap: 2.0Gi 0B 2.0Gi

Important concept: buffer and cache (detailed in next section)

3. Disk Resources

  • Capacity: Total storage space provided by hard drive or SSD
  • Read/Write Performance:
    • Mechanical Hard Drive (HDD): Large capacity, low price, slow speed (100-200 MB/s)
    • Solid State Drive (SSD): Fast speed, high price, relatively small capacity (500-3000 MB/s)
    • NVMe SSD: Even faster (3000-7000 MB/s)

Check disk usage:

1
2
3
df -h  # View partition usage
lsblk # List block devices and mount points
du -sh /* # View space occupied by each directory under root

Check disk I/O:

1
2
iostat -x 1  # Refresh every second (requires sysstat package)
iotop # Real-time view of each process's disk I/O (requires root privileges)

4. Network Resources

  • Bandwidth: Network interface maximum transfer rate (like 1 Gbps, 10 Gbps)
  • Throughput: Actual transfer rate
  • Latency: Packet round-trip time (RTT)

Check network traffic:

1
2
iftop -i eth0  # Real-time display of network traffic (needs iftop installation)
ip -s link # View interface statistics (sent/received packets, dropped packets)

Check network connections:

1
2
ss -tulnp  # View listening ports and connections (replaces netstat)
lsof -i :80 # See which process is using port 80


Buffer and Cache Explained: Why "Out of Memory" Is Often a Misjudgment

Linux memory management is aggressive: use as much memory as possible to cache data, improving performance. So you'll find free shows very little free, but this doesn't mean you're out of memory.

Buffer vs Cache

Type Function Example
Buffer Write buffer (temporary storage before data is written from memory to disk) When writing files, data first stored in buffer, then batch-written to disk
Cache Read cache (data read from disk cached in memory, next read directly from memory) When reading files, content cached in cache, next read instant

Why this design?

  • Buffer: Reduces disk write operations. If every write went directly to disk, too slow (especially for lots of small files). First accumulate a batch of data, then write to disk all at once, much faster.
  • Cache: Reduces disk read operations. Frequently accessed files cached in memory, reading speed hundreds of times faster.

Important: Buffer and Cache are reclaimable. When programs need more memory, the kernel automatically releases buffer/cache to programs. So the available shown in free is the truly available memory (including reclaimable buffer/cache).

Misjudgment example:

1
2
              total        used        free      shared  buff/cache   available
Mem: 15Gi 2.5Gi 1.0Gi 100Mi 11.5Gi 12Gi

Beginners see free only has 1.0Gi and think memory is running out. But actually available is 12Gi (because the 11.5Gi buff/cache is reclaimable).

When is memory truly running out?

  • available approaches 0
  • Swap usage is very high (indicates insufficient memory, starting to use hard drive as memory)
  • Processes are killed by OOM killer (kernel's out-of-memory killer)

Process Monitoring Toolchain: From Overview to Details

1. top: The "Swiss Army Knife" of Real-Time Monitoring

top is the most commonly used real-time monitoring tool, displaying CPU, memory, process info, etc.

Basic usage:

1
top

Interface interpretation:

1
2
3
4
5
6
7
8
9
top - 12:00:00 up 10 days,  3:45,  2 users,  load average: 1.23, 0.87, 0.45
Tasks: 150 total, 2 running, 148 sleeping, 0 stopped, 0 zombie
%Cpu(s): 5.2 us, 2.1 sy, 0.0 ni, 92.3 id, 0.3 wa, 0.0 hi, 0.1 si, 0.0 st
MiB Mem : 15872.0 total, 8234.5 free, 3456.2 used, 4181.3 buff/cache
MiB Swap: 2048.0 total, 2048.0 free, 0.0 used. 11234.5 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1234 root 20 0 123456 12345 1234 R 50.0 0.8 1:23.45 python3
5678 www-data 20 0 234567 23456 2345 S 10.0 1.5 0:12.34 nginx

Key metrics:

  • load average: 1/5/15-minute average load (approaching CPU core count means system at full capacity)
  • Tasks: Total processes, running/sleeping/stopped/zombie process counts
  • %Cpu(s):
    • us (user): User-space CPU usage
    • sy (system): Kernel-space CPU usage
    • ni (nice): Low-priority process CPU usage
    • id (idle): Idle CPU (higher is better)
    • wa (wait): CPU time waiting for I/O (high indicates slow disk/network)
  • Mem/Swap: Memory and swap space usage

Common hotkeys:

  • P: Sort by CPU usage
  • M: Sort by memory usage
  • k: Enter PID to send signal to terminate process
  • 1: Show each CPU core's usage rate
  • q: Quit

2. htop: Enhanced Version of top

htop is a colorful enhanced version of top, supporting mouse operations, tree view, direct process termination.

Install and use:

1
2
3
sudo apt install htop  # Debian/Ubuntu
sudo dnf install htop # CentOS/RHEL
htop

Advantages:

  • Colorful interface, more intuitive
  • Supports mouse clicking to select processes
  • Displays process tree (F5 to toggle tree view)
  • Can directly select and terminate processes (F9 to send signal)

3. ps: Static Process Snapshot

ps provides a static snapshot of current processes (doesn't refresh in real-time like top).

Common usage:

1
2
ps -ef  # Unix style, show all processes (-e) with full info (-f)
ps aux # BSD style, show all processes (a) with user info (u) and background processes (x)

Output field interpretation (ps aux):

  • USER: Process owner
  • PID: Process ID
  • %CPU: CPU usage
  • %MEM: Memory usage
  • VSZ: Virtual memory size (total memory requested by process)
  • RSS: Resident memory size (actual physical memory occupied)
  • STAT: Process state
    • R: Running
    • S: Sleeping (waiting for event)
    • D: Uninterruptible sleep (usually waiting for disk I/O)
    • Z: Zombie (exited but not reaped by parent)
    • T: Stopped (usually paused by Ctrl+Z)
  • TIME: Process cumulative CPU time
  • COMMAND: Process command

Advanced usage:

1
2
3
ps -ef | grep nginx  # View nginx-related processes
ps -ef | grep -v grep # Remove grep's own process
ps -eo pid,ppid,cmd,%cpu,%mem --sort=-%cpu | head -10 # Sort by CPU usage, show top 10

4. pstree: Process Tree

pstree displays process parent-child relationships in tree structure, helping understand process hierarchy.

Basic usage:

1
2
pstree -p  # -p displays PID
pstree -ap # -a displays command parameters

5. lsof: View Open Files

lsof (List Open Files) lists all open files in the system, including regular files, network connections, devices, etc.

Why use lsof?

  • View which files a process has opened (like config files, log files, database files)
  • View which process is using a port (like who's using port 80)
  • View which process is using a file (like if a file can't be deleted, might be in use by a process)
  • Recover accidentally deleted files (if process is still running, file handle still exists, can recover via /proc/<pid>/fd/)

Common usage:

1
2
3
4
5
6
7
8
lsof  # List all open files (very long output)
lsof -p <PID> # View files opened by specified process
lsof -u <user> # View files opened by specified user
lsof -c <command> # View files opened by specified command
lsof -i :80 # See which process is using port 80
lsof -i tcp # View all TCP connections
lsof +D /var/log # View files opened under /var/log directory
lsof +L1 # View files with link count < 1 (usually deleted but still occupied by process)

Example: View all files opened by nginx

1
lsof -c nginx

Example output:

1
2
3
4
5
6
COMMAND   PID     USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
nginx 1234 root cwd DIR 8,1 4096 2 /
nginx 1234 root txt REG 8,1 123456 789 /usr/sbin/nginx
nginx 1234 root 1w REG 8,1 12345 1011 /var/log/nginx/access.log
nginx 1234 root 2w REG 8,1 6789 1012 /var/log/nginx/error.log
nginx 1234 root 6u IPv4 12345 0t0 TCP *:80 (LISTEN)

  • FD: File descriptor (cwd is current directory, txt is program file, 1w is stdout, 2w is stderr, 6u is open socket)
  • TYPE: Type (DIR is directory, REG is regular file, IPv4 is network socket)

6. Network Port Monitoring

ss: View Network Connections (Replaces netstat)

1
ss -tulnp  # View listening ports and connections
  • -t: TCP connections
  • -u: UDP connections
  • -l: Listening ports (LISTEN state)
  • -n: Display numeric addresses and ports (don't resolve hostnames)
  • -p: Display process PID and name

Example: Check port 80 listening status

1
ss -tulnp | grep :80

Example output:

1
tcp   LISTEN 0   128   *:80   *:*   users:(("nginx",pid=1234,fd=6))

lsof: View Port Usage

1
2
lsof -i :80  # See which process is using port 80
lsof -i tcp # View all TCP connections

7. Disk I/O Monitoring

iostat: Disk I/O Statistics

1
iostat -x 1  # Refresh every second, show extended info

Key metrics:

  • %util: Disk utilization (approaching 100% means disk is very busy)
  • await: Average wait time (milliseconds)
  • r/s, w/s: Read/write operations per second

iotop: Real-Time View of Process Disk I/O

1
sudo iotop -o  # -o only shows processes with I/O

Process Control: Signals, Background Tasks, Priorities

1. kill: Send Signals

kill isn't just "kill process"; its essence is sending signals to processes.

Common signals:

Signal Number Signal Name Function Example
1 SIGHUP Reload config (don't terminate process) kill -1 <PID>
2 SIGINT Interrupt (equivalent to Ctrl+C) kill -2 <PID>
9 SIGKILL Force terminate (process can't catch, immediate termination) kill -9 <PID>
15 SIGTERM Gentle terminate (process can catch, cleanup then exit) kill <PID> (default)
20 SIGTSTP Pause (equivalent to Ctrl+Z) kill -20 <PID>

Best practices: 1. First use kill <PID> (SIGTERM), give process chance to cleanup (like saving data, closing connections) 2. If process doesn't respond, then use kill -9 <PID> (SIGKILL) to force terminate

Example: Reload nginx config (without stopping service)

1
2
3
sudo kill -1 $(pidof nginx | awk '{print$1}')  # Send SIGHUP signal
# Or
sudo nginx -s reload # Nginx's convenient command

2. Background Task Running

Method 1: Use &

1
./long_task.sh &  # Run in background (but will be terminated after exiting SSH)
1
nohup ./long_task.sh &  # Run in background, continues after exiting SSH
  • nohup: No Hangup, ignores SIGHUP signal (signal sent when SSH disconnects)
  • Output defaults to redirecting to nohup.out

Better way:

1
nohup ./long_task.sh > /dev/null 2>&1 &  # Discard output, don't save to file

Method 3: Use screen or tmux (Best Practice)

1
2
3
4
5
6
screen -S mysession  # Create a screen session
./long_task.sh # Run task in screen
# Press Ctrl+A+D to detach session (task continues running)
# After exiting SSH, task still runs

screen -r mysession # Reconnect to session

Manage Background Tasks

1
2
3
jobs  # View background tasks
fg %1 # Bring task 1 to foreground
bg %1 # Continue task 1 in background (usually used after Ctrl+Z pause)

3. Adjust Process Priority (nice/renice)

Linux uses nice values to control process priority:

  • nice value range: -20 (highest priority) to 19 (lowest priority)
  • Default nice value: 0
  • Lower nice value = higher priority (easier to grab CPU)

Specify Priority at Startup (nice)

1
2
nice -n 10 ./cpu_intensive_task.sh  # Start with nice value 10 (lower priority)
nice -n -10 ./important_task.sh # Start with nice value -10 (higher priority, needs root)

Adjust Running Process Priority (renice)

1
2
renice -n 10 -p <PID>  # Change process PID's nice value to 10
renice -n -5 -p <PID> # Increase priority (needs root)

Use cases:

  • Background backup tasks: Start with nice -n 19, don't affect normal business
  • Critical business processes: Use renice -n -10 to increase priority

Special Process States: Orphan and Zombie Processes

Orphan Process

Definition: After parent process exits, child process is adopted by PID 1 (systemd or init).

Example code (Python):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import os
import time

def child_process():
print(f"Child: PID={os.getpid()}, PPID={os.getppid()}")
time.sleep(3) # Wait for parent to exit
print(f"Child after parent exit: PID={os.getpid()}, PPID={os.getppid()}")

if __name__ == "__main__":
pid = os.fork()
if pid > 0:
# Parent process
print(f"Parent: PID={os.getpid()}, Child PID={pid}")
os._exit(0) # Parent exits immediately
else:
# Child process
child_process()

Example output:

1
2
3
Parent: PID=1234, Child PID=1235
Child: PID=1235, PPID=1234
Child after parent exit: PID=1235, PPID=1 # PPID becomes 1 (adopted by systemd)

Are orphan processes harmful? Not necessarily. systemd adopts orphan processes and manages them normally.

Zombie Process

Definition: Process has exited, but parent hasn't called wait() to reap its exit status, causing process info to remain in process table.

Characteristics:

  • Doesn't occupy CPU or memory (already exited)
  • But occupies process table entry (too many can exhaust system process table)
  • State shows as Z (Zombie)

View zombie processes:

1
ps aux | grep ' Z '

Example code (Python):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import os
import time

if __name__ == '__main__':
pid = os.fork()
if pid > 0:
# Parent process
print(f"Parent: PID={os.getpid()}, Child PID={pid} (will become zombie)")
time.sleep(15) # Parent pauses 15 seconds, child exits but not reaped (zombie state)
os.wait() # Reap zombie process
print("Zombie child has been reaped.")
else:
# Child process
print(f"Child: PID={os.getpid()}, PPID={os.getppid()}")
os._exit(0) # Child exits immediately, enters zombie state

How to resolve zombie processes? 1. Make parent call wait() (if parent is your program, fix the code) 2. Kill parent process (after parent exits, child is adopted by systemd and reaped) 3. Reboot system (last resort)


Hands-On: Complete Performance Troubleshooting Workflow

Scenario: System Slowed Down, How to Troubleshoot

1. Check Overall Load

1
2
uptime  # Check load average
top # Real-time view of CPU, memory, processes

Diagnosis:

  • load average high? Possibly CPU maxed or I/O slow
  • CPU idle low? CPU bottleneck
  • CPU wa high? Disk I/O slow

2. Find Resource-Consuming Processes

1
2
3
top  # Press P to sort by CPU, M to sort by memory
ps aux --sort=-%cpu | head -10 # View top 10 CPU-consuming processes
ps aux --sort=-%mem | head -10 # View top 10 memory-consuming processes

3. View Process Details

1
2
3
lsof -p <PID>  # View files opened by process
ls -l /proc/<PID>/fd/ # View process file descriptors
cat /proc/<PID>/status # View detailed process status

4. Check Disk I/O

1
2
iostat -x 1  # View disk I/O
sudo iotop -o # See which process is reading/writing disk

5. Check Network Connections

1
2
ss -tulnp  # View listening ports
lsof -i # View all network connections

6. Optimize or Terminate Process

1
2
3
renice -n 10 -p <PID>  # Lower process priority
kill <PID> # Gentle terminate
kill -9 <PID> # Force terminate

Real Case: What to Do When Nginx Log File Is Accidentally Deleted

Scenario

An ops person accidentally ran rm -rf /var/log/nginx/access.log, but nginx process is still running.

Problem

Although file was deleted, nginx process still holds file handle (under /proc/<pid>/fd/), continues writing data to "deleted file". At this point:

  • df -h shows disk usage hasn't decreased (because file still occupies space)
  • ls /var/log/nginx/ doesn't show access.log (because directory entry was deleted)

Solution

1. Find nginx Process PID

1
pidof nginx  # Or ps aux | grep nginx

Suppose main process PID is 1234.

2. View Files Opened by Process

1
lsof -p 1234 | grep access.log

Example output:

1
nginx  1234 root  6w  REG  8,1  123456789  /var/log/nginx/access.log (deleted)

  • 6w: File descriptor is 6, mode is w (write)
  • (deleted): File was deleted but process still holds handle

3. Recover File

1
sudo cp /proc/1234/fd/6 /var/log/nginx/access.log

4. Reload nginx

1
sudo nginx -s reload  # Make nginx reopen log file

Principle:

  • Deleting file only deletes directory entry (filename); inode and data blocks remain (because process is still using it)
  • Via /proc/<pid>/fd/<fd> you can access files opened by process (even if deleted)
  • After cp copying file, reload nginx to make it reopen log file

Summary and Further Reading

This article covers the core content of Linux process and resource management: 1. ✅ Basic concepts of processes and programs (process vs program vs thread, parent-child relationships) 2. ✅ Linux resource management overview (CPU/memory/disk/network) 3. ✅ Buffer and Cache explained (why "out of memory" is often a misjudgment) 4. ✅ Process monitoring toolchain (top/htop/ps/pstree/lsof/ss/iostat) 5. ✅ Process control (kill signals, background tasks, priority adjustment) 6. ✅ Special process states (orphan processes, zombie processes) 7. ✅ Real cases (performance troubleshooting workflow, Nginx log recovery)

Further Reading:

  • Linux Performance (Brendan Gregg): http://www.brendangregg.com/linuxperf.html
  • man proc: View detailed explanation of /proc filesystem
  • man 7 signal: View explanation of all signals

Next Steps:

  • 《 Linux Disk Management 》: Learn partitioning, formatting, mounting, LVM, RAID, etc.
  • 《 Linux User Management 》: Learn how to manage users/groups/permissions

By this point, you should have upgraded from "can view top" to "can quickly locate resource bottlenecks, can optimize process priorities, can handle abnormal process states." Process and resource management is a core Linux ops skill; mastering it allows you to better troubleshoot performance issues.

  • Post title:Linux Process and Resource Management: Monitoring, Troubleshooting, and Optimization
  • Post author:Chen Kai
  • Create time:2023-01-04 00:00:00
  • Post link:https://www.chenk.top/en/linux-process-resource-management/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments