Linux Performance Troubleshooting: Complete Guide for Server Administrators

Introduction: Why Performance Troubleshooting Matters

Slow servers cost money. Whether it’s a web application responding sluggishly, a database struggling under load, or a batch job taking hours longer than expected, performance issues directly impact user experience and business outcomes.

This guide teaches you systematic Linux performance troubleshooting using both classic tools and modern observability platforms. By the end, you’ll be able to quickly identify and resolve bottlenecks in any Linux system.

What you’ll learn:

Systematic troubleshooting methodology
CPU, memory, disk I/O, and network diagnostics
Essential tools: htop, iostat, perf, eBPF
Real-world case studies and solutions

Time required: 20 minutes to read, lifetime to master Skill level: Intermediate

The Troubleshooting Methodology

Before diving into tools, understand the systematic approach:

USE Method (Utilization, Saturation, Errors)

For every resource, check:

Utilization: How busy is the resource?
Saturation: Is there a queue of waiting work?
Errors: Are there hardware or software errors?

Top-Down Approach

1. System level: Is the whole system slow?
2. Subsystem level: CPU, memory, disk, or network?
3. Process level: Which process is the culprit?
4. Code level: What's the specific bottleneck?

Quick Health Check

Start with this 30-second assessment:

# System overview
uptime           # Load averages
free -h          # Memory usage
df -h            # Disk space
iostat -x 1 1    # Disk I/O
vmstat 1 1       # Overall system stats

CPU Performance Troubleshooting

Understanding CPU Metrics

Metric	What It Means	Good Value
us (user)	Application CPU usage	< 70%
sy (system)	Kernel CPU usage	< 30%
wa (iowait)	Waiting for I/O	< 10%
id (idle)	Unused CPU	> 20%
load average	Processes waiting for CPU	< CPU cores

Essential CPU Tools

htop - Interactive Process Viewer

# Install
sudo apt install htop  # Debian/Ubuntu
sudo yum install htop  # RHEL/CentOS

# Run
htop

# Key shortcuts:
# P - Sort by CPU
# M - Sort by Memory
# F - Filter processes
# k - Kill process
# H - Show/hide help

Interpreting htop output:

CPU[||||||||||    75%]     Tasks: 245, 1256 thr; 4 running
Mem[||||||||||||||| 12.4G/16G]   Load average: 2.45 1.89 1.67

Advanced CPU Diagnostics

perf - Linux Performance Counters

# Install
sudo apt install linux-tools-$(uname -r)

# Record CPU samples for 10 seconds
sudo perf record -g -a sleep 10

# View results
sudo perf report

# Top functions by CPU time
sudo perf top

pidstat - Per-process Statistics

# Install (sysstat package)
sudo apt install sysstat

# Monitor CPU every 1 second
pidstat -u 1

# Monitor specific PID
pidstat -p 1234 -u 1

# Show threads
pidstat -t -p 1234 1

Case Study: High CPU Usage

Scenario: Server running at 100% CPU

Diagnosis:

# 1. Check load average
uptime
# Output: load average: 8.45, 6.21, 4.89

# 2. Identify top CPU consumers
top -bn1 | head -20

# 3. Deep dive into process
pidstat -p 1234 -u 1

# 4. Profile with perf
sudo perf top -p 1234

Solution:

# If it's a run-away process
renice +10 -p 1234  # Lower priority
# or
kill -15 1234       # Graceful termination
kill -9 1234        # Force kill (last resort)

# If it's a service
sudo systemctl restart service-name

Memory Performance Troubleshooting

Understanding Memory Metrics

free -h
#               total        used        free      shared  buff/cache   available
# Mem:           15Gi       8.2Gi       4.1Gi       256Mi       3.2Gi        12Gi
# Swap:         4.0Gi          0B       4.0Gi

Key columns:

used: Actual memory in use
free: Completely unused memory
buff/cache: File system cache (reclaimable)
available: Memory available for new applications

Memory Troubleshooting Tools

vmstat - Virtual Memory Statistics

# Run every 1 second
vmstat 1

# Output interpretation:
# procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
#  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
#  2  0      0 4123456 256789 3456789    0    0    12    45  234  567 25  5 68  2  0
#
#  r: Processes waiting for CPU (high = CPU bottleneck)
#  b: Processes in uninterruptible sleep (high = I/O bottleneck)
#  si/so: Swap in/out (non-zero = memory pressure)
#  bi/bo: Blocks read/written
#  wa: I/O wait percentage

Identify Memory Hogs

# Top processes by memory
ps aux --sort=-%mem | head -20

# Detailed memory info per process
for file in /proc/*/status ; do
    awk '/VmSize|Name/{printf $2 " "; print $3} ' $file 2>/dev/null
done | sort -k2 -n -r | head -20

Monitor Memory in Real-time

# Watch memory changes
watch -n 1 'free -h'

# Detailed monitoring
watch -n 1 'cat /proc/meminfo | head -20'

Understanding Swap and OOM

Check swap usage:

swapon --show
cat /proc/sys/vm/swappiness  # Default: 60

OOM (Out of Memory) Killer:

# Check OOM events
dmesg | grep -i "out of memory"
journalctl -k | grep -i "oom"

# View OOM score
cat /proc/1234/oom_score

Case Study: Memory Leak

Scenario: Application slowly consuming all memory

Diagnosis:

# 1. Track memory over time
vmstat 5 12

# 2. Identify growing process
ps aux --sort=-%mem | head -10

# 3. Monitor specific process
watch -n 5 'cat /proc/1234/status | grep -E "VmSize|VmRSS|VmSwap"'

# 4. Check for memory fragmentation
cat /proc/buddyinfo

Solution:

# Short term: Restart service
sudo systemctl restart application

# Long term: Set memory limits
# In systemd service file:
[Service]
MemoryLimit=2G

# Or with cgroups
sudo systemd-run --scope -p MemoryLimit=2G /path/to/application

Disk I/O Performance Troubleshooting

Understanding I/O Metrics

iostat output:

Device            r/s     w/s     rkB/s     wkB/s   rrqm/s   wrqm/s  %util await  r_await  w_await
sda              45.2    23.1    2345.6    1234.5     2.3     15.6   78.5   12.3    8.5     18.2

Metric	What It Means	Good Value
%util	Device utilization	< 80%
await	Average I/O wait time (ms)	< 10ms
r_await	Read wait time	< 5ms
w_await	Write wait time	< 10ms
rkB/s, wkB/s	Throughput	Depends on workload

Disk I/O Tools

iostat - I/O Statistics

# Install (sysstat package)
sudo apt install sysstat

# Extended statistics every 1 second
iostat -x 1

# For specific device
iostat -x /dev/sda 1

iotop - Real-time I/O Monitoring

# Install
sudo apt install iotop

# Run (requires root)
sudo iotop

# Show only processes doing I/O
sudo iotop -o

# Batch mode for logging
sudo iotop -b -n 10 > iotop.log

pidstat for I/O

# Per-process I/O stats
pidstat -d 1

# Specific process
pidstat -d -p 1234 1

Check Disk Health

# SMART status
sudo smartctl -a /dev/sda

# Quick health check
sudo smartctl -H /dev/sda

Case Study: High I/O Wait

Scenario: System sluggish, high iowait

Diagnosis:

# 1. Check overall I/O wait
vmstat 1
# Look at 'wa' column (> 20% indicates problem)

# 2. Identify I/O heavy processes
sudo iotop -o

# 3. Check disk utilization
iostat -x 1

# 4. Find processes with open files
sudo lsof +D /path

Common Solutions:

# 1. Reduce swappiness (if swapping)
sudo sysctl vm.swappiness=10

# 2. Increase read-ahead
sudo blockdev --setra 4096 /dev/sda

# 3. Schedule I/O intensive tasks for off-peak
sudo nice -n 19 backup-script.sh

# 4. Use ionice for process priority
sudo ionice -c2 -n7 -p 1234

Network Performance Troubleshooting

Network Metrics to Monitor

Metric	Tool	Good Value
Bandwidth usage	iftop, nethogs	< 80% of capacity
Packet loss	ping, mtr	< 0.1%
Latency	ping, mtr	< 50ms (local), < 200ms (global)
Connections	ss, netstat	Depends on application
Retransmissions	netstat -s	< 1%

Network Diagnostic Tools

ss - Socket Statistics

# All connections
ss -tunap

# Listening ports
ss -tlnp

# Established connections
ss -tn state established

# Connection statistics
ss -s

iftop - Real-time Bandwidth

# Install
sudo apt install iftop

# Run on interface
sudo iftop -i eth0

# Show port numbers
sudo iftop -np

mtr - Network Diagnostics

# Install
sudo apt install mtr

# Run diagnostic
mtr -rwb100 google.com

# Output shows packet loss at each hop

nethogs - Per-process Bandwidth

# Install
sudo apt install nethogs

# Run
sudo nethogs eth0

Case Study: Network Latency

Scenario: Application experiencing timeouts

Diagnosis:

# 1. Test basic connectivity
ping -c 10 target-server.com

# 2. Trace route with statistics
mtr -rwb100 target-server.com

# 3. Check for connection issues
ss -tn state time-wait | wc -l

# 4. Monitor network errors
netstat -i

Solutions:

# Increase TCP buffer sizes
sudo sysctl -w net.ipv4.tcp_rmem="4096 87380 67108864"
sudo sysctl -w net.ipv4.tcp_wmem="4096 65536 67108864"

# Enable TCP BBR congestion control
sudo sysctl -w net.ipv4.tcp_congestion_control=bbr

# Reduce TIME_WAIT connections
sudo sysctl -w net.ipv4.tcp_tw_reuse=1

Comprehensive Monitoring Solutions

Prometheus + Grafana Stack

Installation:

# Install Prometheus
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz
cd prometheus-*

# Install Node Exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz

# Start services
./prometheus --config.file=prometheus.yml &
./node_exporter &

eBPF-Based Tools (Modern Approach)

bcc-tools installation:

# Ubuntu/Debian
sudo apt install bpfcc-tools

# RHEL/CentOS
sudo yum install bcc-tools

Useful eBPF tools:

# CPU profiling
sudo profile-bpfcc -p 1234

# Disk I/O latency
sudo biopattern

# Network monitoring
sudo tcptop

# Memory allocation
sudo memleak

Performance Troubleshooting Checklist

Use this checklist for systematic diagnosis:

Quick Assessment (5 minutes)

CPU Investigation

htop - Identify top CPU consumers
pidstat -u 1 - Per-process CPU stats
perf top - Profile CPU usage

Memory Investigation

vmstat 1 - Check for swapping
ps aux --sort=-%mem - Top memory consumers
cat /proc/meminfo - Detailed memory info

Disk I/O Investigation

iostat -x 1 - Device utilization
iotop - Per-process I/O
pidstat -d 1 - Process I/O stats

Network Investigation

ss -tunap - Active connections
iftop - Bandwidth usage
mtr target - Network path analysis

Summary

Linux performance troubleshooting requires:

Systematic approach: Use USE method and top-down diagnosis
Right tools: Master htop, vmstat, iostat, perf, and eBPF tools
Understanding metrics: Know what good looks like
Practice: Build intuition through experience

Key takeaways:

Start with quick health checks before deep diving
Correlate metrics across CPU, memory, disk, and network
Use modern eBPF tools for deeper insights
Establish baselines to identify anomalies

Resources:

Introduction: Why Performance Troubleshooting Matters

The Troubleshooting Methodology

USE Method (Utilization, Saturation, Errors)

Top-Down Approach

Quick Health Check

CPU Performance Troubleshooting

Understanding CPU Metrics

Essential CPU Tools

Advanced CPU Diagnostics

Case Study: High CPU Usage

Memory Performance Troubleshooting

Understanding Memory Metrics

Memory Troubleshooting Tools

Understanding Swap and OOM

Case Study: Memory Leak

Disk I/O Performance Troubleshooting

Understanding I/O Metrics

Disk I/O Tools

Case Study: High I/O Wait

Network Performance Troubleshooting

Network Metrics to Monitor

Network Diagnostic Tools

Case Study: Network Latency

Comprehensive Monitoring Solutions

Prometheus + Grafana Stack

eBPF-Based Tools (Modern Approach)

Performance Troubleshooting Checklist

Quick Assessment (5 minutes)

CPU Investigation

Memory Investigation

Disk I/O Investigation

Network Investigation

Summary

Tags

Related Tutorials

Ubuntu 26.04 LTS Preview: What's New and Should You Upgrade?

Mastering iptables: Linux Firewall Fundamentals

Log Rotation with logrotate: Linux System Management