Blag o' dkam

p0wning tubez

Archive for the ‘SysAdmin’ Category

On Daemons

without comments

New post on Daemons over on the Booko blog.

Written by dkam

August 7th, 2010 at 10:59 pm

Making Booko work better with Google

without comments

Over on the Booko Blog.

Written by dkam

August 24th, 2009 at 11:57 am

Sysadmin triage

without comments

Back when I was a professional sysadmin (now I just do it for fun) I came up with a few simple tests to perform on misbehaving hosts. These tests are very obvious and easy to check, but they’re worth remembering because too often we’re tempted to look for complex solutions to problems that, initially, look complex. It’s humbling just how often what looks like a complex software issue, really isn’t complex at all.

So when things go wrong, before reverting the last change, before breaking out gdb and strace and before tweaking your software on your production host, spend 5 minutes and run through these quick, simple tests – there’s a high likelihood that you’ll solve your problem quickly. (I’m sure there’s other tests you can do – these are the ones burnt into my mind)

#1 – Disk space

Don’t laugh – running out of disk space can cause you pain in so many ways it’ll spin your head.   Check all partitions, including /tmp, /var/tmp and /var. Running out of tmp means applications won’t be able to write temporary files which, depending on the app, may make it behave very strangely. /var is used for many things including logging in /var/log – not being able to log will make some software cry like a baby – i.e. it may crash and you’ll have no idea why – it certainly won’t be in the log file.  Databases like MySQL don’t like having no room to write in /var/lib/mysql – Don’t be surprised if you get some db corruption. With MySQL, you may be able to start the database and even connect to it with the mysql client, leading you to look elsewhere – but checking disk space will take you seconds.

dkam@vihko:~$ df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             9.4G  7.1G  1.8G  80% /
udev                   10M  116K  9.9M   2% /dev
shm                   128M     0  128M   0% /dev/shm

Don’t forget to check iNodes too – running out of inodes can cause the same issues as diskspace but is less obvious – checking for it is just as easy though:

dkam@vihko:~$ df -ih
Filesystem            Inodes   IUsed   IFree IUse% Mounted on
/dev/sda1               1.2M    445K    772K   37% /
udev                     32K    1.1K     31K    4% /dev
shm                      32K       1     32K    1% /dev/shm

#2 - DNS resolution

DNS resolution problems can cause your system and application to hang or timeout in very strange ways.

Some applications will log the name of inbound network connections, performing reverse lookups. If no NS is available, these connections may start to take a long time to connect, as the software waits for the resolver to timeout. If only the first listed NS has failed, this timeout may be variable in length, but probably around 15 seconds. If you see weird lags or delays, check your name servers. This can happen when you’re trying to ssh into the host – if you’re getting delays connecting via ssh, check DNS.  If your software makes connections to external databases for example, and is configured to address them by name, you’ll see these timeouts.

This one can be tricky because some software will cache the name resolution and some local resolvers may cache – meaning you’ll see delays or timeouts sometimes, but not consistently.

Name lookups should be under a second, preferably in the low 100′s of milliseconds.

dkam@vihko:~$ time host www.apple.com
www.apple.com is an alias for www.apple.com.akadns.net.
www.apple.com.akadns.net has address 17.251.200.32

real	0m0.132s
user	0m0.000s
sys	0m0.000s
dkam@vihko:~$ time host www.apple.com
www.apple.com is an alias for www.apple.com.akadns.net.
www.apple.com.akadns.net has address 17.251.200.32

real	0m0.011s
user	0m0.000s
sys	0m0.000s

You can see that in the second run, the name server had cached the value and returned much faster.

It also pays to check each nameserver listed in /etc/reslov.conf:

dig www.google.com @208.78.97.155

Naturally replace 208.78.97.155 with your name server’s IP.

Check “man resolv.conf” for more information.

#3 – Ulimits

The most common ulimit’s that I’ve come across is max number of open files, but you may see others including max user processes.  This one is generally obvious if the software is running as a regular user – when you try to connect as that user you will see error messages about being unable to allocate resources. Editing a file or trying to read a man page will error out if you’re at the maximum number of open files.  Network connections fall into this category also – so you may not be able to open network connections either.

The more likely scenario is that the software is running as a different user – one that people don’t log in as.  Try logging in as, or su -’ing to the user – if you can’t or you can but the user can’t open files, check the ulimits.  In Bash, try “ulimit -a” to view your limits. Different OSs limit these values in different ways – check your OS doco for details.

#4 - /dev/random

This is a little esoteric and is pretty unlikely, but /dev/random is used for lots of reasons – the most common use that you may have problems with is login with software that uses stuff like CRAM-MD5. Random data is used as part of the authentication process and when there’s not enough random data, logging in will be slow or may timeout completely.  Most software should probably fall back to using /dev/urandom.  You can time how long it takes to read 1kb of random data like this:

dkam@vihko:~$ dd if=/dev/urandom of=/dev/null bs=1K count=10
10+0 records in
10+0 records out
10240 bytes (10 kB) copied, 0.002011 s, 5.1 MB/s

#5 - permissions

Generally this will only bite you when you’ve made changes or updated software – check that config files are readable, data directories are read/writable and that executables are executable.

Written by dkam

June 8th, 2009 at 12:33 pm

Posted in Geeky,SysAdmin

USB2 Vs FW400 Vs FW800

without comments

Had access to a 1TB WD drive recently. Comes with USB2, FireWire 400 & FireWire 800. Thought I’d check out the performance of the various connection methods. It had two internal 500GB drives arranged in RAID 0 (striped). I tested it by running:

The movie was 525MB and I did each test 3 times. The very first run was slowest — presumably the file (or parts of it) was in the disk cache for subsequent runs. Here’s the times:

Just to make sure my laptop drive wasn’t affecting the test, I also performed this test (several times for each):

For comparison:
Local SATA drive: 20.3 seconds

My laptop hard drives results were a bit erratic — peaking up to 27 seconds and down to 20. No doubt due in part to the 21 applications I’m currently running. Stopping iTunes playing helped things ;-)

I was surprised that FireWire 400 was that much faster than USB2 — I’d always assumed they were on par. Anyway, looks like FW800 is clearly the king for connecting external HD.

Written by dkam

June 25th, 2008 at 5:55 pm

Posted in Geeky,SysAdmin

Migration to Passenger ( mod_rails )

without comments

Ruby On Rails apps are finally easy to install — mod_rails is here. Just installed it for the Blag ( Typo 5.0.1 on Gentoo Unstable ) and it looks to be working quite nicely. Only roadbump I encountered was problems getting static content to be served. Kept getting:

The problem was pretty straight forward — now that mongrel isn’t serving up the static content, I had to make sure that apache was configured to allow access. Added a this stanza:

Naturally it feels snappier. Nice.

Written by dkam

June 22nd, 2008 at 3:49 pm

Posted in Ruby,SysAdmin

Frustrating

without comments

I spent several hours yesterday fighting with RubyGems — I’d even written a vitriolic post about it — but I … did something … and bam, like that it was gone. RubyGem is Ruby’s version of Perl’s CPAN. It’s got a very annoying trait — its prodigious use of memory. For each gem ( a gem is a Ruby module — like rails or hpricot for example ) RubyGem would load the spec into memory in order ( I’m guessing after reading a bunch of forum posts ) to build a dependency tree. On a 256MB slice host, this pushes you into swap hell. On a 512MB host it would use up to 68% of memory.

So what can you do but rent a bigger slicehost? Moe Sizlack said it best: “I’m choking on my own rage over here!”

Naturally, it’s been fixed. Today.

If only I’d done something more constructive yesterday. Like played COD.

Written by dkam

June 22nd, 2008 at 1:55 pm

SSH Tricks

with one comment

Read this tip here – if you add:

to your .ssh/config, ssh will reuse your network connections to hosts with already established ssh sessions – saving a bit of time and reducing the number of network sockets you’ve got.

for the details. You learn something new every week eh?

Written by dkam

June 16th, 2008 at 8:56 pm

Posted in Geeky,SysAdmin

I might be too old for Gentoo Unstable. It could be time for Debian. Or Ubuntu. Or Gentoo stable maybe. :-(

with one comment

Perhaps my blood sugar is just too low.

Seeing this type of error in Gentoo?

Apparently touch no longer works in my distro. Can’t build anything.

http://bugs.gentoo.org/show_bug.cgi?id=224483

It appears my kernel (2.6.21-gentoo-r3) is out of date compared with my linux-headers (2.6.25-r3) package and this screws up the latest version of touch which comes with coreutils. Update your kernel. But touch doesn’t work, so you need to go get an old version:

Now go update your kernel.

Written by dkam

June 12th, 2008 at 8:16 pm

Posted in Geeky,Rant,SysAdmin

Windows Hatoraide

with 3 comments

Before my current job, my last crack at using MS Windows as a desktop was in 2000. It lasted about 6 months before I decided Linux made a better desktop. Back then I was using Windows 2000. I didn’t have any particular gripe with it – just didn’t really get along with it. I found it to be reasonably stable – at least as stable as a Mac of that era but less stable than Linux. I used Linux until November 2005 when I moved back to Mac. The change was dramatic: when I left, I was running Mac OS 9 and I came back to Mac OS 10.4 – a modern Unix based OS which, after years on Linux, felt like a good mix of Unix and lickable buttons.

At my current job, I don’t have the benefit of choosing the OS of my desktop, and the imagined redtape involved in connecting my laptop to their network means that for the first time in a long time, I’ve got Windows on the desktop. I’ve been using Windows at work for almost 4 months. Windows XP, SP2.

After all this time, I figured Windows would have improved. Surely, I thought, after almost 8 years of development things must be good – after all, consider the improvements in Linux since Red Hat 6. Literally leaps and bounds in all possible measures. Mac OS also went from an OS with no preemptive multitasking ( they used cooperative multitasking) and no memory protection to a modern Unix with all associated goodness.

Windows XP, to me, doesn’t seem to have improved at all since 2000 aside from that annoying dark blue Crayola “theme”. It might be marginally more stable, although that could just me treating it with kid gloves.

When my PC started playing up recently, I called in tech support. They asked when I’d last rebooted it. I said, “Oh a week or two?” – Everyone looked at my like I was King of the Muppet people. I retorted to the laughter with “Seriously? I though that was a joke! It’s 2008!” They laughed, closed the ticket and told me to reboot. It didn’t fix the problem. Next they suggested I’d installed to many applications and that was slowing it down. I’d installed Safari and the related Apple stuff dragged in (iTunes, quicktime, Bonjour, Software Update), Firefox, Wireshark and a couple of Jabber clients to test a jabber server. That’s it.

I agreed to uninstall the stuff I wasn’t using or didn’t really need. The tech support guy opened up the add/remove program control panel and I deleted a few bits of software I’d been playing with (Safari for Windows). After removing software and rebooting again, the problem persisted. So I figured I’d uninstall some more software – but you need admin rights to open the add/remove programs control panel! So, you can install software, but you can’t uninstall software?! Apparently this is all you can do to fix problems. The next step, in the estimation of my Windows support guy, is to rebuilding the box from scratch.

Now, I admit, I’m no Windows administrator, but how can you seriously run websites or email servers or anything with a modicum of importance on this stuff? It’s garbage. No wonder MS are so into clustering – rather than fixing problems on servers, you simply rebuild them. Just take it out of the cluster and nuke it.

It doesn’t stop there. I’ve been documenting the more ridiculous aspects of life on Windows. Windows Explorer, one of the most used apps seems to have stood still. It does not keep the files on the right in alphabetical order. Copying or saving a file from an application into a directory simply adds that item to the end of the list. It seems you have to refresh the folder list to get them in order. Which is odd because the list of files & servers constantly flickers – I guess because of the attached network shares updating in the background. If you’re going to have an annoying, flickering explorer, at least it should be updating the file list to keep them correctly sorted. It’s kind of like an old fluorescent tube with a bad starter. It seems you can hide files beginning with a dot in the right hand side, but but not for folders on the left. Not only that, but Explorer doesn’t seem to be able to create a folder or file beginning with a “.”. To cap it off, the Explorer has no duplicate function, only copy/cut & paste.

Opening documents in Excel, Word or Visio sometimes opens them in separate windows, sometimes in the same parent window. This is some sort of brain dead MDI behaviour. I can’t figure out how to make individual documents consistently open in their own window. To top it off, in both cases there are two buttons in the task bar. Alt-Tab shows you two word icons – naturally they’re different – one probably represents the “Parent”.

Of course, both windows show up in the task bar, and naturally, each window from the same app respond differently to mouse-down / mouse-up. Sometimes you get the window on mouse-down, sometimes you have to wait for mouse-up. Sometimes, it works like you expect. Usually when you call someone over to show them.

Say you have Word open, then you open another word window, then a third. The layering of windows is inconsistent. Click and hold on each of the different windows in the task bar, and you’ll get different responses – sometimes the window jumps straight to the top, sometimes a different windows jumps to the top on mouse-down, only to disappear behind the correct window when you mouse-up.

The number of reboots is funny – like most jokes about Windows, this one is true. Ian just bought a new laptop – the first thing it did, after starting up, was to reboot. And then to reboot. And then one more reboot. Just in case. I think the total was higher than this, but I lost count. You have to reboot when upgrading Acrobat. I try not to use caps lock. Just in case.

Want to look for a file? I tried using the inbuilt search function for the file “services” and initially didn’t return any results. When I open the C:\WINDOWS\system32\drivers\etc directory – so I could see the services file, the next search found it. Naturally this file can exist in C:\WINDOWS\system32\drivers\etc\services or C:\WINNT\system32\drivers\etc\services. For consistency, they’ve capped it at two locations. Brilliant.

Windows Security sucks. You can have your ability to change the desktop image removed – however you can still set the desktop image with an application such as Paint which hasn’t been locked down. You can install applications, but have your access to “Add/Remove Programs” denied so you can’t uninstall them. Security in this sense seems to apply only to the method of doing something, rather than the result. So, instead of locking the desktop image, they lock the ways they know of changing it. Instead of stopping you installing software, they block access to the Add/Remove program.

I have this fond memory of Visio – the excellent drawing program. Doesn’t seem to have changed since I first saw it in 1999. Compared to OmniGraffle, it’s rubbish. Lines are never straight – they often seem to have kinks in them because the snap-to function doesn’t seem to consider a straight line as useful. Nothing seems to be anti-aliased. There’s jagged lines all over the shop. Fonts look especially craptacular – but I’ll accept that I’m just used to seeing them nicely anti-aliased on a Mac, and that you can “get used to it”. They do look better if you turn on the ClearType stuff ( just a simple download, install and reboot for smooth fonts. ) Ctrl-w doesn’t close a window in Visio – it zooms to fill the window with the document. You need good old alt-f4 or alt-f-c. Naturally neither of these are listed as short cuts are documented in the “File->Close” menu option.

It’s this inconsistency that really gets to me. Just when you think you’ve got it sorted, it randomly changes.

I hear tell that Vista may actually be better, but reports are conflicted. For the foreseeable future, Windows for me will be simply a boot loader for Call of Duty 4.

Written by dkam

May 7th, 2008 at 10:35 pm

Operations as a competitive advantage

without comments

I love reading articles like these, mostly because they deal with issues that I see almost every day in my day job. Adding a new server to your deployment should be as simple as doing a base install and then pointing your configuration management system at it. The hard work should be done once, defining services, their configuration and their relationships.

The operational efficiencies gained from an automated configuration management system should extend beyond growing your current server farm. The time taken to track down bugs and reproduce problems should fall substantially when you know all your servers have the correct configuration. No more diff’ing the config across multiple servers to figure out why one is behaving different to another. No more checking software version numbers across hosts, because sometimes, a host is missed during an upgrade. No more wondering if apache is supposed to be installed on one of your mail servers.

Once your operational staff are relieved of these tedious tasks, their time can be used more effectively in improving aspects of your service. All those tasks that should be done “one day” such as implementing or improving backups, capacity planning or monitoring and reporting of the service can finally get some love.

As we move towards a virtualisation of hardware, automation of provisioning, building and management of servers will become ever more critical. Businesses with advanced operational practices will gain a competitive edge over those organisations who still manually build, configure and maintain their hosts.

Written by dkam

November 4th, 2007 at 2:38 pm

Posted in Puppet,SysAdmin