Timeouts and failing fast

Sunday, 23. 01. 2011  –  Category: sw, web

“Integration points are the number-one killer of systems” – Release It!, Michael Nygard

Last week I had two different web systems fail in a similar way. One was a single box running two busy WordPress sites, another was a largish multi-tier publishing cluster. Both dropped off air because they didn’t handle the failure of a remote system very well. In particular, both had the webserver waiting on a HTTP request to a foreign site to complete before returning a page to the client.

The architecture issues with this kind of request amplification are reasonably clear, and can sometimes be avoided. Data ingest from other systems can often execute asynchronously in another process. The cluster mentioned has a set of machines dedicated to just this, distributing the data they retrieve from remote systems via database and shared filesystem ready for the webservers to use when building pages.

However, in the situations where it is necessary for a webserver to dial out to another system at request time then it’s worth being really paranoid about how that outbound request works and untrusting of the reply. The particular problem here was lack of timeouts.

Here, both systems were running mpm_prefork Apache. Both were serving a request that made a HTTP call somewhere else before returning the page to the client1. Last week the remote sites that both these systems contact had outages, leaving Apache waiting for the reply up until the request timed out.

However, with no timeout configured that wait is effectively infinite. Long enough for these hanging Apache processes to consume all the available slots of the webserver, resulting an interesting set of observations:

  • New TCP connections to the server just hang.
  • Server load is low, because the load measures runnable processes (and, on Linux, uninterruptible sleeping processes), and these hanging processes are just sleeping, waiting on poll(2) or select(2).
  • No errors are logged by Apache or the application code because they haven’t failed yet.2

Here’s some vmstat during such a wedge:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0 140580  56528  58168 516184    0    0     0     0 1026   78  0  0 100  0  0
 0  0 140580  56528  58168 516184    0    0     0     0 1032   80  0  0 100  0  0
 0  0 140580  56528  58168 516184    0    0     0     0 1031   78  0  0 100  0  0

“It was quiet. Too quiet”. Hook up gdb and you can see what’s going on:

(gdb) bt
#0  0x00002aafffb7d14f in poll () from /lib64/libc.so.6
#1  0x00002ab0090b3930 in Curl_select () from /usr/lib64/libcurl.so.3
#2  0x00002ab0090ac4bc in Curl_perform () from /usr/lib64/libcurl.so.3
#3  0x00002ab00850591b in zif_curl_exec () from /etc/httpd/modules/libphp5.so
#4  0x00002ab0086663b2 in ?? () from /etc/httpd/modules/libphp5.so
#5  0x00002ab00865651c in execute () from /etc/httpd/modules/libphp5.so
#6  0x00002ab00865dc09 in ?? () from /etc/httpd/modules/libphp5.so
#7  0x00002ab00865651c in execute () from /etc/httpd/modules/libphp5.so
#8  0x00002ab008688750 in ?? () from /etc/httpd/modules/libphp5.so
#9  0x00002ab00865651c in execute () from /etc/httpd/modules/libphp5.so
#10 0x00002ab00865dc09 in ?? () from /etc/httpd/modules/libphp5.so
#11 0x00002ab00865651c in execute () from /etc/httpd/modules/libphp5.so
#12 0x00002ab00865c11e in ?? () from /etc/httpd/modules/libphp5.so
#13 0x00002ab00865651c in execute () from /etc/httpd/modules/libphp5.so
#14 0x00002ab0086395de in zend_execute_scripts () from /etc/httpd/modules/libphp5.so
#15 0x00002ab0085fe697 in php_execute_script () from /etc/httpd/modules/libphp5.so
#16 0x00002ab0086b6ad6 in ?? () from /etc/httpd/modules/libphp5.so
#17 0x00002aaffdbb7a4a in ap_run_handler ()
#18 0x00002aaffdbbaed8 in ap_invoke_handler ()
#19 0x00002aaffdbc578a in ap_internal_redirect ()
#20 0x00002ab0069ffbc0 in ap_make_dirstr_parent () from /etc/httpd/modules/mod_rewrite.so
#21 0x00002aaffdbb7a4a in ap_run_handler ()
#22 0x00002aaffdbbaed8 in ap_invoke_handler ()
#23 0x00002aaffdbc5938 in ap_process_request ()
#24 0x00002aaffdbc2b70 in ?? ()
#25 0x00002aaffdbbecd2 in ap_run_process_connection ()
#26 0x00002aaffdbc9789 in ?? ()
#27 0x00002aaffdbc9a1a in ?? ()
#28 0x00002aaffdbc9ad0 in ?? ()
#29 0x00002aaffdbca7bb in ap_mpm_run ()
#30 0x00002aaffdba4e48 in main ()

Another kink is that a graceful restart of Apache, via SIGUSR1, doesn’t work since a graceful restart waits for a request to finish – and these ones aren’t finishing. apachectl or service(8) scripts will exit but the hung processes remain. These long running wedged processes are also visible in ps:

apache     580  0.0  0.3  30332  5492 ?        S    Jan02   0:00

and can be matched up with the hung connections via netstat -anp

tcp        0      0
    ESTABLISHED 580/httpd

This is a mostly a long winded way of pointing out the importance of timeouts on code that executes during a request, particularly when doing something non-local. Fail fast!

To keep the WordPress sites running I ran through the code adding CURLOPT_CONNECTTIMEOUT, CURLOPT_TIMEOUT settings to all the cURL calls I could find, and the development team took care of the code running on the cluster. Both have been stable since.

There are plenty of other pitfalls in these integration points between systems. Michael Nygard’s book, linked at top, makes a good survey of them alongside other stability antipatterns in complex systems. It’s a recommended read.

  1. One of the WordPress sites was doing this in every page’s footer. And not caching the result. Better still, it was contacting the other WordPress site on the same server. This almost guarantees the request won’t return when the shared webserver gets busy since there may be no spare server slots to accept the second request. Ouch. []
  2. If you don’t run Apache hot, then you’d see a “I’ve reached MaxClients” message in the error log, but not much else. []

Comments are closed.