Timeouts and failing fast

Sunday, 23. 01. 2011  –  Category: sw, web

“Integration points are the number-one killer of systems” – Release It!, Michael Nygard

Last week I had two different web systems fail in a similar way. One was a single box running two busy WordPress sites, another was a largish multi-tier publishing cluster. Both dropped off air because they didn’t handle the failure of a remote system very well. In particular, both had the webserver waiting on a HTTP request to a foreign site to complete before returning a page to the client.

The architecture issues with this kind of request amplification are reasonably clear, and can sometimes be avoided. Data ingest from other systems can often execute asynchronously in another process. The cluster mentioned has a set of machines dedicated to just this, distributing the data they retrieve from remote systems via database and shared filesystem ready for the webservers to use when building pages.

However, in the situations where it is necessary for a webserver to dial out to another system at request time then it’s worth being really paranoid about how that outbound request works and untrusting of the reply. The particular problem here was lack of timeouts.

Here, both systems were running mpm_prefork Apache. Both were serving a request that made a HTTP call somewhere else before returning the page to the client1. Last week the remote sites that both these systems contact had outages, leaving Apache waiting for the reply up until the request timed out.

However, with no timeout configured that wait is effectively infinite. Long enough for these hanging Apache processes to consume all the available slots of the webserver, resulting an interesting set of observations:

  • New TCP connections to the server just hang.
  • Server load is low, because the load measures runnable processes (and, on Linux, uninterruptible sleeping processes), and these hanging processes are just sleeping, waiting on poll(2) or select(2).
  • No errors are logged by Apache or the application code because they haven’t failed yet.2

Here’s some vmstat during such a wedge:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 140580 56528 58168 516184 0 0 0 0 1026 78 0 0 100 0 0
0 0 140580 56528 58168 516184 0 0 0 0 1032 80 0 0 100 0 0
0 0 140580 56528 58168 516184 0 0 0 0 1031 78 0 0 100 0 0

“It was quiet. Too quiet”. Hook up gdb and you can see what’s going on:

(gdb) bt
#0 0x00002aafffb7d14f in poll () from /lib64/libc.so.6
#1 0x00002ab0090b3930 in Curl_select () from /usr/lib64/libcurl.so.3
#2 0x00002ab0090ac4bc in Curl_perform () from /usr/lib64/libcurl.so.3
#3 0x00002ab00850591b in zif_curl_exec () from /etc/httpd/modules/libphp5.so
#4 0x00002ab0086663b2 in ?? () from /etc/httpd/modules/libphp5.so
#5 0x00002ab00865651c in execute () from /etc/httpd/modules/libphp5.so
#6 0x00002ab00865dc09 in ?? () from /etc/httpd/modules/libphp5.so
#7 0x00002ab00865651c in execute () from /etc/httpd/modules/libphp5.so
#8 0x00002ab008688750 in ?? () from /etc/httpd/modules/libphp5.so
#9 0x00002ab00865651c in execute () from /etc/httpd/modules/libphp5.so
#10 0x00002ab00865dc09 in ?? () from /etc/httpd/modules/libphp5.so
#11 0x00002ab00865651c in execute () from /etc/httpd/modules/libphp5.so
#12 0x00002ab00865c11e in ?? () from /etc/httpd/modules/libphp5.so
#13 0x00002ab00865651c in execute () from /etc/httpd/modules/libphp5.so
#14 0x00002ab0086395de in zend_execute_scripts () from /etc/httpd/modules/libphp5.so
#15 0x00002ab0085fe697 in php_execute_script () from /etc/httpd/modules/libphp5.so
#16 0x00002ab0086b6ad6 in ?? () from /etc/httpd/modules/libphp5.so
#17 0x00002aaffdbb7a4a in ap_run_handler ()
#18 0x00002aaffdbbaed8 in ap_invoke_handler ()
#19 0x00002aaffdbc578a in ap_internal_redirect ()
#20 0x00002ab0069ffbc0 in ap_make_dirstr_parent () from /etc/httpd/modules/mod_rewrite.so
#21 0x00002aaffdbb7a4a in ap_run_handler ()
#22 0x00002aaffdbbaed8 in ap_invoke_handler ()
#23 0x00002aaffdbc5938 in ap_process_request ()
#24 0x00002aaffdbc2b70 in ?? ()
#25 0x00002aaffdbbecd2 in ap_run_process_connection ()
#26 0x00002aaffdbc9789 in ?? ()
#27 0x00002aaffdbc9a1a in ?? ()
#28 0x00002aaffdbc9ad0 in ?? ()
#29 0x00002aaffdbca7bb in ap_mpm_run ()
#30 0x00002aaffdba4e48 in main ()

Another kink is that a graceful restart of Apache, via SIGUSR1, doesn’t work since a graceful restart waits for a request to finish – and these ones aren’t finishing. apachectl or service(8) scripts will exit but the hung processes remain. These long running wedged processes are also visible in ps:

apache 580 0.0 0.3 30332 5492 ? S Jan02 0:00
/usr/sbin/httpd

and can be matched up with the hung connections via netstat -anp

tcp 0 0 172.18.74.113:50129 192.0.32.10:80
ESTABLISHED 580/httpd

This is a mostly a long winded way of pointing out the importance of timeouts on code that executes during a request, particularly when doing something non-local. Fail fast!

To keep the WordPress sites running I ran through the code adding CURLOPT_CONNECTTIMEOUT, CURLOPT_TIMEOUT settings to all the cURL calls I could find, and the development team took care of the code running on the cluster. Both have been stable since.

There are plenty of other pitfalls in these integration points between systems. Michael Nygard’s book, linked at top, makes a good survey of them alongside other stability antipatterns in complex systems. It’s a recommended read.

  1. One of the WordPress sites was doing this in every page’s footer. And not caching the result. Better still, it was contacting the other WordPress site on the same server. This almost guarantees the request won’t return when the shared webserver gets busy since there may be no spare server slots to accept the second request. Ouch. []
  2. If you don’t run Apache hot, then you’d see a “I’ve reached MaxClients” message in the error log, but not much else. []

FreeBSD, gettext, libintl and bash

Monday, 25. 10. 2010  –  Category: sw

Login surprise!

Copyright (c) 1980, 1983, 1986, 1988, 1990, 1991, 1993, 1994
        The Regents of the University of California.  All rights reserved.

FreeBSD 8.0-RELEASE-p4 (GENERIC) #0: Mon Jul 12 20:22:27 UTC 2010

/libexec/ld-elf.so.1: Shared object "libintl.so.8" not found, required by "bash"
$

What’s happened here is that gettext (which provides libintl.so) has been upgraded without its dependencies noticing (read: I messed up some package maintenance). These dependencies are linked against a specific version of this library that’s now gone. One of these dependencies is bash, my preferred shell.

Having your shell asplode like that is ungood. If bash has been set as my default shell then I wouldn’t be able to log in at all! This is a risk of relying on software that isn’t part of FreeBSD’s core to absolutely always work across system maintenance. bash is a third-party port, and it’s prudent to anticipate package management fail1.

To side-step this risk I tend to set my shell to something that is native to a FreeBSD release (plain ole’ /bin/sh) and adding the following in my .profile, to test if bash is indeed working before running it:

/usr/local/bin/bash -c true && exec /usr/local/bin/bash

And to fix all the other ports that are now broken:

$ sudo portmaster -r -R devel/gettext

  1. because all package management sucks at some point, yes? []

devopsdays Hamburg 2010

Monday, 18. 10. 2010  –  Category: conference

Recently back from devopsdays in Hamburg. Here’s a quick braindump whilst it’s all still fresh.

People

I went with a bunch of colleagues from the BBC, hooked up with some people I’d not seen for ages, unexpectedly met in real life people I’d worked with online previously and met a whole load of talented and interested engineers.

Talks

The conference was split between set talks and open space sessions. The talk I got the most out of was Stephen Nelson Smith‘s talk on his experience bringing devops-shaped thinking to a large government site refresh. His thoughts about what worked and what didn’t were interesting and well made.

John Willis gave an excellent presentation on Chef. I’ve monkeyed a bit with Chef but am only using Puppet in real world situations so far. John’s a powerful evangelist and definitely got me enthused about Chef. I deliberately didn’t ask about Chef’s lack of dry-run mode since I’ve banged on about it on the chef-users list twice now1, and because I wanted to see if anyone else flagged it as a concern – they did. All that aside, I am certainly going to get it up in a lab soon.

Link dump

  • Marionette Collective (mcollective) – message bus framework for mass host control. I’m really interested by this. Seems similar to Engine Yard‘s Vertebrae thing, but perhaps more immediately useable for me.
  • Fabric – a more traditional (ie: centralised) mass control mechanism. We’ve all rolled our own variant of this at some point, right?
  • Zookeeper – mentioned in passing, need to check this out.
  • cucumber-puppet – not currently doing anything like this, probably should!
  • Circonus – flagged in passing, monitoring.
  • Web Operations – seems to be a good compliment to Release It!, will grab and read.

Après-ski

The conference laid on a boat trip around Hamburg’s canals and harbour on the Friday night with beer and soup. Brilliant! Afterwards we maxed out the lounge area of the Chilli Club. After sufficient beers and cocktails we headed off to the Bunker club for an epic all nighter2. Clearly we didn’t know about the school gym dress theme but hey that didn’t matter. The utterly bizarre story that unfolded on stage throughout the night was hilarious. House of Pain MCd by a large sweaty man in a baby costume? Oh yes! (more outrageous moments omitted…)

Saturday night included more Brownian motion around the Reeperbahn, plus Mike and I hanging out in a bar 0wned by St Pauli supporters. I don’t know much about football but these guys are clearly very dedicated to their cause! Edgy. More drinks into the dawn, then tactical sleep before flying home.

Hamburgers

I was consistently amazed how friendly everyone is. Wandering around Hamburg you could strike up conversation with pretty much anyone and find something of interest to jib about. At several points strangers would change the course of their evening just to steer our slightly cat-herdesque group of geeks towards fun.

Thanks

Huge thanks to everyone who helped make this an excellent weekend, especially Marcel and Patrick.

  1. and, for the record, I understand why such a mode would be imperfect and not aligned with Chef’s overall philosophy []
  2. but no, I didn’t quite last the whole night! []