Back when I was a professional sysadmin (now I just do it for fun) I came up with a few simple tests to perform on misbehaving hosts. These tests are very obvious and easy to check, but they're worth remembering because too often we're tempted to look for complex solutions to problems that, initially, look complex. It's humbling just how often what looks like a complex software issue, really isn't complex at all.
So when things go wrong, before reverting the last change, before breaking out gdb and strace and before tweaking your software on your production host, spend 5 minutes and run through these quick, simple tests - there's a high likelihood that you'll solve your problem quickly. (I'm sure there's other tests you can do - these are the ones burnt into my mind)
#1 - Disk space
Don't laugh - running out of disk space can cause you pain in so many ways it'll spin your head. Check all partitions, including /tmp, /var/tmp and /var. Running out of tmp means applications won't be able to write temporary files which, depending on the app, may make it behave very strangely. /var is used for many things including logging in /var/log - not being able to log will make some software cry like a baby - i.e. it may crash and you'll have no idea why - it certainly won't be in the log file. Databases like MySQL don't like having no room to write in /var/lib/mysql - Don't be surprised if you get some db corruption. With MySQL, you may be able to start the database and even connect to it with the mysql client, leading you to look elsewhere - but checking disk space will take you seconds.
dkam@vihko:~$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 9.4G 7.1G 1.8G 80% / udev 10M 116K 9.9M 2% /dev shm 128M 0 128M 0% /dev/shm
Don't forget to check iNodes too - running out of inodes can cause the same issues as diskspace but is less obvious - checking for it is just as easy though:
dkam@vihko:~$ df -ih Filesystem Inodes IUsed IFree IUse% Mounted on /dev/sda1 1.2M 445K 772K 37% / udev 32K 1.1K 31K 4% /dev shm 32K 1 32K 1% /dev/shm
#2 - DNS resolution
DNS resolution problems can cause your system and application to hang or timeout in very strange ways.
Some applications will log the name of inbound network connections, performing reverse lookups. If no NS is available, these connections may start to take a long time to connect, as the software waits for the resolver to timeout. If only the first listed NS has failed, this timeout may be variable in length, but probably around 15 seconds. If you see weird lags or delays, check your name servers. This can happen when you're trying to ssh into the host - if you're getting delays connecting via ssh, check DNS. If your software makes connections to external databases for example, and is configured to address them by name, you'll see these timeouts.
This one can be tricky because some software will cache the name resolution and some local resolvers may cache - meaning you'll see delays or timeouts sometimes, but not consistently.
Name lookups should be under a second, preferably in the low 100's of milliseconds.
dkam@vihko:~$ time host www.apple.com www.apple.com is an alias for www.apple.com.akadns.net. www.apple.com.akadns.net has address 220.127.116.11 real0m0.132s user0m0.000s sys0m0.000s dkam@vihko:~$ time host www.apple.com www.apple.com is an alias for www.apple.com.akadns.net. www.apple.com.akadns.net has address 18.104.22.168 real0m0.011s user0m0.000s sys0m0.000s
You can see that in the second run, the name server had cached the value and returned much faster.
It also pays to check each nameserver listed in /etc/reslov.conf:
dig www.google.com @22.214.171.124
Naturally replace 126.96.36.199 with your name server's IP.
Check "man resolv.conf" for more information.
#3 - Ulimits
The most common ulimit's that I've come across is max number of open files, but you may see others including max user processes. This one is generally obvious if the software is running as a regular user - when you try to connect as that user you will see error messages about being unable to allocate resources. Editing a file or trying to read a man page will error out if you're at the maximum number of open files. Network connections fall into this category also - so you may not be able to open network connections either.
The more likely scenario is that the software is running as a different user - one that people don't log in as. Try logging in as, or su -'ing to the user - if you can't or you can but the user can't open files, check the ulimits. In Bash, try "ulimit -a" to view your limits. Different OSs limit these values in different ways - check your OS doco for details.
#4 - /dev/random
This is a little esoteric and is pretty unlikely, but /dev/random is used for lots of reasons - the most common use that you may have problems with is login with software that uses stuff like [CRAM-MD5. Random data is used as part of the authentication process and when there's not enough random data, logging in will be slow or may timeout completely. Most software should probably fall back to using /dev/urandom. You can time how long it takes to read 1kb of random data like this:
dkam@vihko:~$ dd if=/dev/urandom of=/dev/null bs=1K count=10 10+0 records in 10+0 records out 10240 bytes (10 kB) copied, 0.002011 s, 5.1 MB/s
#5 - permissions
Generally this will only bite you when you've made changes or updated software - check that config files are readable, data directories are read/writable and that executables are executable. ](http://en.wikipedia.org/wiki/CRAM-MD5)