Status (by robm at Sun Jul 5 02:44 UTC)
Annoyingly, two web servers crashed about 6 hours ago in a really peculiar way. The failure mode was such that our 2 minute checks didn’t detect the failures, and the frontend routing servers didn’t automatically route to the alternate web servers.
The net result of this is that for the last 6 hours or so, the web interface has been randomly "flaky", working fine for some people, not working for others, and intermittently working for some.
We’ve now got the servers back up, and will be adding new code to our regular two minute checks to handle this new failure mode, so the servers are correctly taken out of service if it happens again.
Update (by robm at Sun Jul 5 03:14 UTC)
The problem only affected web services. All other services (IMAP, POP, LDAP, etc) have been running fine.
As mentioned, we’ll make this a high priority item this week to make sure this doesn’t happen again.