Linux System Troubleshooting

From ben.goodacre.name/tech

Jump to: navigation, search

This page shows various commands to ascertain how a Linux system is performing and why it may be running slowly if there are other problems.

Contents

vmstat

Details that amount pages-in and pages-out every x seconds among other information. See Vmstat.

top

Shows continuously-updating information for processes. Processes are sorted in CPU usage descending by default.

Load averages are shown for the last 1, 5 and 15 minutes. Every system is different but as a guide a single CPU system should normally not have a 15 minute load average over 2.

Interactive commands

Command Function
ADisplays 4 windows:
Window 1: Sorted by CPU descending.
Window 2: Sorted by PID descending - shows newest porcesses.
Window 3: Sorted by Memory usage descending.
Window 4: Shows your processes.
SpaceUpdate now
PSort by CPU usage descending
MSort by Memory usage descending
TSort by cumulative time
fShow/Hide fields
oChange column order
1 Show individual CPUs
c Toggle show command-lines

http://linux.die.net/man/1/top

CPU Usage summary fields

us = user mode
sy = system mode
ni = low priority user mode (nice)
id = idle task
wa = I/O waiting - waiting for disk to complete operation usually
hi = servicing hardware interrupts
si = servicing software interrupts
t = steal (time given to other DomU instances)

ps

To show all processes including their command-lines:

ps aux

lsof

View current open files for all processes:

lsof -n
View all open files in /var/log:
lsof +D /var/log

See Also

Open file handling

sar

Sar shows system activity history and is very useful for ascertaining activity when a cron job runs when it cannot be run in daytime hours.

Show CPU activity history:

sar
Show IO history:
sar -b
Show paging history:
sar -B
Show load average and queue history
 sar -q
Show memory usage history:
sar -r

free

To show the current free and in-use physical RAM and swapped with low/high values in MB:

free -ml

fuser

Fuser shows what processes are using a given file, socket or filesystem.

Show processes that have locked files in /var:
sudo fuser -v /var

iostat

Iostat gives statistics on I/O and CPU usage. iostat without any arguments displays transfers per second since boot as 'tps' although a single logical transfer could be any size and can therefore be misleading. The number of blocks read and written since boot and the blocks read and written per second 'since boot' can also be viewed. iostat -x gives more information, but this again is not very useful as it only shows the total since boot.

Iostat is more useful when it is giving repeating output as each output after the first output shows stats 'since the last output' rather than since boot. iostat 1 5 will give 5 outputs with a 1 second delay. The 2nd - 5th output will look different to the first for this reason.

Get extended IO statistics, updating once per second:
iostat 1 -x
The column that will indicate a bottleneck in the I/O are shown below:
Column Definition
avgqu-szThe average size (in sectors) of the requests that were issued to the device.
awaitThe average time spent servicing I/O requests, including the time spent in a queue.
 %utilCPU utilisation of the device when requests were issued.

tcpdump

Tcpdump gives detailed verbose network traffic output. Some examples are shown below, which all refer to the interface eth0. Use ifconfig -a to find all interfaces.

Show HTTP traffic:

tcpdump -i eth0 'tcp port 80'
Show DNS traffic:
tcpdump -i eth1 'udp port 53'

Show DNS traffic, and HTTP traffic on ports and 80, 81 and 82:

tcpdump -i eth0 udp port 53 or tcp \( 80 or 81 or 82 \)
Show traffic on port 80 for 1.1.1.1:
tcpdump -i eth0 dst 1.1.1.1 and tcp port 80

Show traffic on port 80 for 1.1.1.1 with an increased sniff size of 1024 bytes:

tcpdump -i eth0 -s 1024 dst 1.1.1.1 and tcp port 80

Send output to a pcap file

The -w parameter captures the output to a file that can be viewed later.

Capture SSH traffic for 2.2.2.2 and output to a pcap file named ssh.pcap:

tcpdump -i eth0 -w ssh.pcap dst 2.2.2.2 tcp port 22

This output can then be viewed later:

tcpdump -ttttnnr ssh.pcap

See Also

netcat

Netcat can be best described as a replacement for using telnet to truobleshoot specific network connections. In many scenarios telnet is just as useful as netcat but unlike telnet netcat can connect to UDP ports, has no problems interpreting certain characters and a source interface can be specified.

Connect to server:port
netcat -vv server port

See Also

strace

strace shows the systems calls for a process or programme. Try strace echo hello. Strace can either invoke a process and monitor system calls end-to-end or attach itself to a process already running.

Monitor system calls for pid 12345:

strace -p 12345

Ascertain what config files are read on startup

This is useful for ascertaining why a particular config file is not being read from the location it should be in your experience and where it could be reading it from instead. The -e parameter allows the displaying of a particular system call only.

For example the SQL query tool tsql refused to read its configuration on a monitoring server when it worked fine on another. The command below opens tsql and captured all the open system calls:

strace -e open tsql -S testconfig -U test -P test

View the output Line 45 shows that it tries to open the /home/exampleusername/.freetds.conf, finds that it does not exist and then line 46 shows the /usr/local/etc/freetds.conf file being read. The file that I had been editing was /etc/freetds.conf .

This can be expanded to see whether or not a programme has access to a file by viewing the access system call too for a particular filename:

strace -e open,access executable | grep filename.conf

Ascertain what a process is doing

If a particular process is using a lot of CPU or I/O log files along may give no information as to what is causing the problem. The -p parameter described above achieve this.

Ascertain what system calls are being used the most

Statistics on what system calls are being used can be shown with the -c parameter. This can more easily show why a process is using high CPU or I/O by showing which system calls are being used the most.

Show systems calls by the ls / command
[user@host /]$ strace -c ls /usr >/dev/null
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 55.00    0.000055           3        17           read
 45.00    0.000045           2        19           fstat64
  0.00    0.000000           0         1           write
  0.00    0.000000           0        19           open
  0.00    0.000000           0        25           close
  0.00    0.000000           0         1           execve
  0.00    0.000000           0         2         1 access
  0.00    0.000000           0         3           brk
  0.00    0.000000           0         3         3 ioctl
  0.00    0.000000           0         8           munmap
  0.00    0.000000           0         1           uname
  0.00    0.000000           0         6           mprotect
  0.00    0.000000           0         2           readv
  0.00    0.000000           0         2           rt_sigaction
  0.00    0.000000           0         1           rt_sigprocmask
  0.00    0.000000           0         1           getrlimit
  0.00    0.000000           0        33           mmap2
  0.00    0.000000           0         1           stat64
  0.00    0.000000           0         2           getdents64
  0.00    0.000000           0        13           fcntl64
  0.00    0.000000           0         1           futex
  0.00    0.000000           0         1           set_thread_area
  0.00    0.000000           0         1           set_tid_address
  0.00    0.000000           0         1           set_robust_list
  0.00    0.000000           0         5           socket
  0.00    0.000000           0         5         4 connect
  0.00    0.000000           0         1           sendmsg
------ ----------- ----------- --------- --------- ----------------
100.00    0.000100                   175         8 total
Instead of capturing statistics in an end-to-end manner a snapshot can be taken strace attaching itself to a currently running process, until Ctrl+C is used. The example below captures statistics for Nagios:
[user@host /]$ ps aux|grep nagios
nagios    1071  0.0  0.3  12988  1616 ?        Ssl   2009  57:35 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
benyg    28856  0.0  0.1   3912   672 pts/6    R+   15:50   0:00 grep nagios
[user@host /]$ sudo strace -c -p 1071
Password:
Process 1071 attached - interrupt to quit
Process 1071 detached
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 51.61    0.000048          48         1           waitpid
 48.39    0.000045           1        46           gettimeofday
  0.00    0.000000           0         1           restart_syscall
  0.00    0.000000           0        12           write
  0.00    0.000000           0         3           open
  0.00    0.000000           0         3           close
  0.00    0.000000           0        77           time
  0.00    0.000000           0         1           rename
  0.00    0.000000           0         2           umask
  0.00    0.000000           0         2           munmap
  0.00    0.000000           0         1           fchmod
  0.00    0.000000           0         1         1 fsync
  0.00    0.000000           0         1           clone
  0.00    0.000000           0         2           _llseek
  0.00    0.000000           0         2           getdents
  0.00    0.000000           0        32           nanosleep
  0.00    0.000000           0         2           mmap2
  0.00    0.000000           0         1           stat64
  0.00    0.000000           0         3           fstat64
  0.00    0.000000           0         3           fcntl64
------ ----------- ----------- --------- --------- ----------------
100.00    0.000093                   196         1 total

This shows that in the 5 seconds whilst strace was running Nagios spent most of its time waiting, followed by getting the time of day.

Ascertain why a process cannot connect to a server

tcpdump can be used instead although strace will show information on a specific process and thereby avoid most of the 'chatter.' Below is an exanple of curl connecting to ben.goodacre.name without any problems:

  1 [user@host /]$ strace -e poll,select,connect,recvfrom,sendto curl ben.goodacre.name >/dev/null
  2 connect(3, {sa_family=AF_FILE, path="/var/run/setrans/.setrans-unix"...}, 110) = 0
  3 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"...}, 110) = -1 ENOENT (No such file or directory)
  4 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"...}, 110) = -1 ENOENT (No such file or directory)
  5 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, 28) = 0
  6 poll([{fd=3, events=POLLOUT}], 1, 0)    = 1 ([{fd=3, revents=POLLOUT}])
  7 poll([{fd=3, events=POLLIN}], 1, 5000)  = 1 ([{fd=3, revents=POLLIN}])
  8 recvfrom(3, "\223\203\201\200\0\1\0\1\0\1\0\0\3ben\10goodacre\4name\0\0"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, [16]) = 120
  9 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, 28) = 0
 10 poll([{fd=3, events=POLLOUT}], 1, 0)    = 1 ([{fd=3, revents=POLLOUT}])
 11 poll([{fd=3, events=POLLIN}], 1, 5000)  = 1 ([{fd=3, revents=POLLIN}])
 12 recvfrom(3, "\f\177\201\200\0\1\0\2\0\r\0\f\3ben\10goodacre\4name\0\0"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, [16]) = 504
 13 connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("67.223.225.228")}, 16) = -1 EINPROGRESS (Operation now in progress)
 14 poll([{fd=3, events=POLLOUT}], 1, 300000) = 1 ([{fd=3, revents=POLLOUT}])
 15 poll([{fd=3, events=POLLIN}], 1, 1000)  = 1 ([{fd=3, revents=POLLIN}])
 16 poll([{fd=3, events=POLLIN}], 1, 0)     = 1 ([{fd=3, revents=POLLIN}])
 17   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
 18                                  Dload  Upload   Total   Spent    Left  Speed
 19 100   701  100   701    0     0   1609      0 --:--:-- --:--:-- --:--:--     0
Try strace -e poll,select,connect,recvfrom,sendto curl ben.goodacre.name:111 >/dev/null to see what the POLLOUT line looks like when curl cannot connect. Port 111 is not open.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox