Linux System Troubleshooting
From ben.goodacre.name/tech
This page shows various commands to ascertain how a Linux system is performing and why it may be running slowly if there are other problems.
Contents |
vmstat
Details that amount pages-in and pages-out every x seconds among other information. See Vmstat.
top
Shows continuously-updating information for processes. Processes are sorted in CPU usage descending by default.
Load averages are shown for the last 1, 5 and 15 minutes. Every system is different but as a guide a single CPU system should normally not have a 15 minute load average over 2.
Interactive commands
| Command | Function |
| A | Displays 4 windows: Window 1: Sorted by CPU descending. Window 2: Sorted by PID descending - shows newest porcesses. Window 3: Sorted by Memory usage descending. Window 4: Shows your processes. |
| Space | Update now |
| P | Sort by CPU usage descending |
| M | Sort by Memory usage descending |
| T | Sort by cumulative time |
| f | Show/Hide fields |
| o | Change column order |
| 1 | Show individual CPUs |
| c | Toggle show command-lines |
http://linux.die.net/man/1/top
CPU Usage summary fields
us = user mode
sy = system mode
ni = low priority user mode (nice)
id = idle task
wa = I/O waiting - waiting for disk to complete operation usually
hi = servicing hardware interrupts
si = servicing software interrupts
t = steal (time given to other DomU instances)
ps
To show all processes including their command-lines:
ps aux
lsof
View current open files for all processes:
lsof -nView all open files in /var/log:
lsof +D /var/log
See Also
sar
Sar shows system activity history and is very useful for ascertaining activity when a cron job runs when it cannot be run in daytime hours.Show CPU activity history:
sarShow IO history:
sar -bShow paging history:
sar -BShow load average and queue history
sar -qShow memory usage history:
sar -r
free
To show the current free and in-use physical RAM and swapped with low/high values in MB:
free -ml
fuser
Fuser shows what processes are using a given file, socket or filesystem.
Show processes that have locked files in /var:sudo fuser -v /var
iostat
Iostat gives statistics on I/O and CPU usage. iostat without any arguments displays transfers per second since boot as 'tps' although a single logical transfer could be any size and can therefore be misleading. The number of blocks read and written since boot and the blocks read and written per second 'since boot' can also be viewed. iostat -x gives more information, but this again is not very useful as it only shows the total since boot.
Iostat is more useful when it is giving repeating output as each output after the first output shows stats 'since the last output' rather than since boot. iostat 1 5 will give 5 outputs with a 1 second delay. The 2nd - 5th output will look different to the first for this reason.
Get extended IO statistics, updating once per second:iostat 1 -xThe column that will indicate a bottleneck in the I/O are shown below:
| Column | Definition |
| avgqu-sz | The average size (in sectors) of the requests that were issued to the device. |
| await | The average time spent servicing I/O requests, including the time spent in a queue. |
| %util | CPU utilisation of the device when requests were issued. |
tcpdump
Tcpdump gives detailed verbose network traffic output. Some examples are shown below, which all refer to the interface eth0. Use ifconfig -a to find all interfaces.
Show HTTP traffic:
tcpdump -i eth0 'tcp port 80'Show DNS traffic:
tcpdump -i eth1 'udp port 53'
Show DNS traffic, and HTTP traffic on ports and 80, 81 and 82:
tcpdump -i eth0 udp port 53 or tcp \( 80 or 81 or 82 \)Show traffic on port 80 for 1.1.1.1:
tcpdump -i eth0 dst 1.1.1.1 and tcp port 80
Show traffic on port 80 for 1.1.1.1 with an increased sniff size of 1024 bytes:
tcpdump -i eth0 -s 1024 dst 1.1.1.1 and tcp port 80
Send output to a pcap file
The -w parameter captures the output to a file that can be viewed later.
Capture SSH traffic for 2.2.2.2 and output to a pcap file named ssh.pcap:
tcpdump -i eth0 -w ssh.pcap dst 2.2.2.2 tcp port 22
This output can then be viewed later:
tcpdump -ttttnnr ssh.pcap
See Also
netcat
Netcat can be best described as a replacement for using telnet to truobleshoot specific network connections. In many scenarios telnet is just as useful as netcat but unlike telnet netcat can connect to UDP ports, has no problems interpreting certain characters and a source interface can be specified.
Connect to server:portnetcat -vv server port
See Also
strace
strace shows the systems calls for a process or programme. Try strace echo hello. Strace can either invoke a process and monitor system calls end-to-end or attach itself to a process already running.
Monitor system calls for pid 12345:
strace -p 12345
Ascertain what config files are read on startup
This is useful for ascertaining why a particular config file is not being read from the location it should be in your experience and where it could be reading it from instead. The -e parameter allows the displaying of a particular system call only.
For example the SQL query tool tsql refused to read its configuration on a monitoring server when it worked fine on another. The command below opens tsql and captured all the open system calls:
strace -e open tsql -S testconfig -U test -P test
View the output Line 45 shows that it tries to open the /home/exampleusername/.freetds.conf, finds that it does not exist and then line 46 shows the /usr/local/etc/freetds.conf file being read. The file that I had been editing was /etc/freetds.conf .
This can be expanded to see whether or not a programme has access to a file by viewing the access system call too for a particular filename:
strace -e open,access executable | grep filename.conf
Ascertain what a process is doing
If a particular process is using a lot of CPU or I/O log files along may give no information as to what is causing the problem. The -p parameter described above achieve this.
Ascertain what system calls are being used the most
Statistics on what system calls are being used can be shown with the -c parameter. This can more easily show why a process is using high CPU or I/O by showing which system calls are being used the most.
Show systems calls by the ls / command[user@host /]$ strace -c ls /usr >/dev/null % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 55.00 0.000055 3 17 read 45.00 0.000045 2 19 fstat64 0.00 0.000000 0 1 write 0.00 0.000000 0 19 open 0.00 0.000000 0 25 close 0.00 0.000000 0 1 execve 0.00 0.000000 0 2 1 access 0.00 0.000000 0 3 brk 0.00 0.000000 0 3 3 ioctl 0.00 0.000000 0 8 munmap 0.00 0.000000 0 1 uname 0.00 0.000000 0 6 mprotect 0.00 0.000000 0 2 readv 0.00 0.000000 0 2 rt_sigaction 0.00 0.000000 0 1 rt_sigprocmask 0.00 0.000000 0 1 getrlimit 0.00 0.000000 0 33 mmap2 0.00 0.000000 0 1 stat64 0.00 0.000000 0 2 getdents64 0.00 0.000000 0 13 fcntl64 0.00 0.000000 0 1 futex 0.00 0.000000 0 1 set_thread_area 0.00 0.000000 0 1 set_tid_address 0.00 0.000000 0 1 set_robust_list 0.00 0.000000 0 5 socket 0.00 0.000000 0 5 4 connect 0.00 0.000000 0 1 sendmsg ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000100 175 8 totalInstead of capturing statistics in an end-to-end manner a snapshot can be taken strace attaching itself to a currently running process, until Ctrl+C is used. The example below captures statistics for Nagios:
[user@host /]$ ps aux|grep nagios nagios 1071 0.0 0.3 12988 1616 ? Ssl 2009 57:35 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg benyg 28856 0.0 0.1 3912 672 pts/6 R+ 15:50 0:00 grep nagios [user@host /]$ sudo strace -c -p 1071 Password: Process 1071 attached - interrupt to quit Process 1071 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 51.61 0.000048 48 1 waitpid 48.39 0.000045 1 46 gettimeofday 0.00 0.000000 0 1 restart_syscall 0.00 0.000000 0 12 write 0.00 0.000000 0 3 open 0.00 0.000000 0 3 close 0.00 0.000000 0 77 time 0.00 0.000000 0 1 rename 0.00 0.000000 0 2 umask 0.00 0.000000 0 2 munmap 0.00 0.000000 0 1 fchmod 0.00 0.000000 0 1 1 fsync 0.00 0.000000 0 1 clone 0.00 0.000000 0 2 _llseek 0.00 0.000000 0 2 getdents 0.00 0.000000 0 32 nanosleep 0.00 0.000000 0 2 mmap2 0.00 0.000000 0 1 stat64 0.00 0.000000 0 3 fstat64 0.00 0.000000 0 3 fcntl64 ------ ----------- ----------- --------- --------- ---------------- 100.00 0.000093 196 1 total
This shows that in the 5 seconds whilst strace was running Nagios spent most of its time waiting, followed by getting the time of day.
Ascertain why a process cannot connect to a server
tcpdump can be used instead although strace will show information on a specific process and thereby avoid most of the 'chatter.' Below is an exanple of curl connecting to ben.goodacre.name without any problems:
1 [user@host /]$ strace -e poll,select,connect,recvfrom,sendto curl ben.goodacre.name >/dev/null
2 connect(3, {sa_family=AF_FILE, path="/var/run/setrans/.setrans-unix"...}, 110) = 0
3 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"...}, 110) = -1 ENOENT (No such file or directory)
4 connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"...}, 110) = -1 ENOENT (No such file or directory)
5 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, 28) = 0
6 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
7 poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}])
8 recvfrom(3, "\223\203\201\200\0\1\0\1\0\1\0\0\3ben\10goodacre\4name\0\0"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, [16]) = 120
9 connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, 28) = 0
10 poll([{fd=3, events=POLLOUT}], 1, 0) = 1 ([{fd=3, revents=POLLOUT}])
11 poll([{fd=3, events=POLLIN}], 1, 5000) = 1 ([{fd=3, revents=POLLIN}])
12 recvfrom(3, "\f\177\201\200\0\1\0\2\0\r\0\f\3ben\10goodacre\4name\0\0"..., 1024, 0, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("127.0.0.1")}, [16]) = 504
13 connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("67.223.225.228")}, 16) = -1 EINPROGRESS (Operation now in progress)
14 poll([{fd=3, events=POLLOUT}], 1, 300000) = 1 ([{fd=3, revents=POLLOUT}])
15 poll([{fd=3, events=POLLIN}], 1, 1000) = 1 ([{fd=3, revents=POLLIN}])
16 poll([{fd=3, events=POLLIN}], 1, 0) = 1 ([{fd=3, revents=POLLIN}])
17 % Total % Received % Xferd Average Speed Time Time Time Current
18 Dload Upload Total Spent Left Speed
19 100 701 100 701 0 0 1609 0 --:--:-- --:--:-- --:--:-- 0
- Lines 3 and 4 show curl connecting to the NSCD - Name Service Cache Daemon - which is used for LDAP lookups among other things. *The connect fails so it moves onto DNS on lines 5 and 9 where it connects to port 53 which is shown by sin_port=htons(53). It connects to localhost - sin_addr=inet_addr("127.0.0.1") as a DNS server listens to localhost and is configured in /etc/resolv.conf.
- The DNS connection is done twice as ben.goodacre.name is a CNAME, so a second DNS lookup is required. Line 8 is where the CNAME is received and line 12 is where the IP address is returned.
- The curl client is seen connecting to the IP 67.223.225.228 of this website on line 13. EINPROGRESS shows the connection has been attempted but not that it has succeeded.
- The complete POLLOUT line shows a sucessful connect, on line 14.