This is a bit of a long post to give a description of where things are currently at and what happened.
As we’ve mentioned recently in our newsletters we’re in the process of moving to a replicated server setup. We have bought a number of pairs of new servers where each email delivered and action occuring on one server is replicated to the other server. We’ve been setting this up, and doing some testing, but there’s been some problems with the replication code in the IMAP server we’re using. We’ve been helping to track down and fix those problems where we can help.
As part of this process, we have been moving users on to one of the new server pairs. This has been a gradual process over the last month or so, as we’ve become more confident with the hardware and software. We obviously heavily test all new hardware and software before we put it in production, but even after we’re happy we still only gradually move users onto any new hardware after testing to ensure that we can monitor everything at each stage as we add more users. So in this case, the new server (imap5) has been running fine for the last month. We’ve also had replication running to it’s pair (imap6) successfully as well.
It appears that around 1pm EST, the RAID controller in the new imap5 server died in some way (we’re still working with the hardware people to try and work out what happened). At this point, we would have normally switched across to the replica. However some quick testing showed that in fact there was a problem with the replica we hadn’t realised. We had observed that all emails were being replicated, but it appears that the “index” data in the replica was incorrect in some way, causing messages to appear as “empty” to any IMAP client logging in (including the web interface).
At this point we stopped the replica, and looked at restoring the original server again. The RAID controllers we bought explicitly store their configuration data on the drives. What this means is that if a controller fails, you can pull out the controller, and replace it with a new one and all configuration data is preserved. So we removed the controller and replaced it. Unfortunately there is a problem with this. To improve system performance, we used cards with a write-back memory cache including a battery backup unit. This is great for our system performance, but unfortunately in this case if the controller becomes totally dead, it means that data that the OS thought it wrote to the disks is actually not written. This leaves the disks in a corrupted state requiring a “fsck”. Even though we’ve reduced the volume sizes, a fsck will still take over a day and no access is possible at all until the fsck is complete.
So we went back to look at the replica. Since all the emails were actually there, it was just the index that was incomplete, it should be possible to rebuild the index for each mailbox and enable access as it completes. The IMAP server does provide a utility to do this, so we’re now going through rebuilding the indexes for each mailbox and re-enabling access to the mailbox as each rebuild is complete. We’re not sure at this point exactly how long this will take for all mailboxes to complete, but as mailboxes are completed, they will automatically come back on line and IMAP and web logins will be immediately available again.
If access is urgent, please email webmaster@fastmail.fm with the subject “Priority rebuild” and make sure you include your account name in the email, we’ll ensure your rebuild occurs ASAP.
This whole episode seems to have been a confluence of events to conspire against us at this stage. We’ve never had a RAID card fail on us before, but we knew we could just swap a working card with a failed one. What we didn’t take into account was the battery backed up cache would actually cause us problems in that case. Even in that case, we were prepared for total hardware failure because we also have our replica for use when the main server fails. We had been monitoring it to see that email was being replicated, but hadn’t realised that the IMAP indexes weren’t being correctly replicated. That’s defintiely something we’ll be adding to our monitoring scripts and working with the IMAP software people to work out what went wrong and fix.
We’ll update in a couple of hours when we have some more details on the rebuild progress and when we have some more status updates, as well as working our what went wrong with the index replication.