Short outage

Status (by brong at Fri May 13 11:49 UTC)
A misconfiguration has required us to restart all our imap servers with no advance notice, so we’re in the middle of a rolling outage across all servers.

Update (by brong at Fri May 13 12:40 UTC)
UPDATE: looks like a bigger problem than we first thought. Estimated outage about 1 hour.

Update (by robm at Fri May 13 13:57 UTC)
Ok it’s been taking a bit longer than we expected.

The configuration error caused a lot of processes to crash and generate core dumps that filled up some partitions. When the partitions filled up, the regular repack processes on the imap servers mailbox listing corrupted some of the files. We can’t reliably start imap servers with these files corrupted.

We’re restoring from backup data and ensuring they’re up to date, and then will restart all servers.

Update (by robm at Fri May 13 14:47 UTC)
Because a disk full condition causes files to be randomly truncated, it left some databases in a corrupted state. We’ve written a script to find those databases and restore data from regular backups and replica servers.

That script is running, and should be another 30 minutes or so. We’ll then check that all data looks consistent, and bring up a few services at a time. We want to do that carefully, to check everything is definitely ok, and it should take another 30 minutes or so to gradually bring things back online.

All email is being accepted and queued as normal, and will be delivered once the imap servers are back up.

Update (by robm at Fri May 13 15:55 UTC)
All imap servers are back up again now.

All users should be able to login and access their email.

There is still a large queue of backlogged email that will be delivered over the next few hours.

Posted in Status. Comments Off
Follow

Get every new post delivered to your Inbox.

Join 54 other followers