More updates

Over 3/4′s of the users have now been rebuilt and re-activated, so should only be a few more hours for the remainder. Several hundred people have now asked for priority rebuilds, and I’ve just completed another big batch of those.
Over the last couple of hours we’ve discovered 2 issues that may be affecting some users.

  1. It appears that some less regularly used folders may be missing some messages from the last couple of days
  2. A small number of users were not replicated at all, and thus priority rebuilding has not been possible because there was no data to rebuild

The good news is that we’ve completed the “fsck” of the drives from the original server, and so basically have all the original data. We’re now replicating the users that had no replicated data. Over the next couple of days, we’ll be finding all the messages on the original servers that weren’t replicated correctly (eg are missing) and copying them over, so if you’re missing a few messages from some folders, please be patient, we should be able to find and restore them shortly.

Again sorry for any incovenience people have had. We’ve been trying our hardest to setup servers and systems so that hardware failures like this are invisible because we have a replicated setup, unfortunately things didn’t work out in this case. We’ll definitely be making some more changes…

imap5 status update

About half the accounts have now been rebuilt and re-activated. Almost everyone who has requested a priority rebuild (as described below) has had their account rebuilt and activated. There appear to be a few accounts which hadn’t replicated properly, I’m looking into those.
We’re “fscking” the disks from the original master server so that we can restore it and replication ASAP. We haven’t had time yet to investiage why the IMAP replica index was incorrect, we’ll do that once all users are back online.

Some more investigation seems to suggest that it was not the RAID card that was the problem, but was the motherboard in the machine. The hardware vendor is sending us a new motherboard ASAP. The drives from the machine have been placed into another housing cabinet and are being fscked there.

imap5 status

This is a bit of a long post to give a description of where things are currently at and what happened.

As we’ve mentioned recently in our newsletters we’re in the process of moving to a replicated server setup. We have bought a number of pairs of new servers where each email delivered and action occuring on one server is replicated to the other server. We’ve been setting this up, and doing some testing, but there’s been some problems with the replication code in the IMAP server we’re using. We’ve been helping to track down and fix those problems where we can help.

As part of this process, we have been moving users on to one of the new server pairs. This has been a gradual process over the last month or so, as we’ve become more confident with the hardware and software. We obviously heavily test all new hardware and software before we put it in production, but even after we’re happy we still only gradually move users onto any new hardware after testing to ensure that we can monitor everything at each stage as we add more users. So in this case, the new server (imap5) has been running fine for the last month. We’ve also had replication running to it’s pair (imap6) successfully as well.

It appears that around 1pm EST, the RAID controller in the new imap5 server died in some way (we’re still working with the hardware people to try and work out what happened). At this point, we would have normally switched across to the replica. However some quick testing showed that in fact there was a problem with the replica we hadn’t realised. We had observed that all emails were being replicated, but it appears that the “index” data in the replica was incorrect in some way, causing messages to appear as “empty” to any IMAP client logging in (including the web interface).

At this point we stopped the replica, and looked at restoring the original server again. The RAID controllers we bought explicitly store their configuration data on the drives. What this means is that if a controller fails, you can pull out the controller, and replace it with a new one and all configuration data is preserved. So we removed the controller and replaced it. Unfortunately there is a problem with this. To improve system performance, we used cards with a write-back memory cache including a battery backup unit. This is great for our system performance, but unfortunately in this case if the controller becomes totally dead, it means that data that the OS thought it wrote to the disks is actually not written. This leaves the disks in a corrupted state requiring a “fsck”. Even though we’ve reduced the volume sizes, a fsck will still take over a day and no access is possible at all until the fsck is complete.

So we went back to look at the replica. Since all the emails were actually there, it was just the index that was incomplete, it should be possible to rebuild the index for each mailbox and enable access as it completes. The IMAP server does provide a utility to do this, so we’re now going through rebuilding the indexes for each mailbox and re-enabling access to the mailbox as each rebuild is complete. We’re not sure at this point exactly how long this will take for all mailboxes to complete, but as mailboxes are completed, they will automatically come back on line and IMAP and web logins will be immediately available again.

If access is urgent, please email webmaster@fastmail.fm with the subject “Priority rebuild” and make sure you include your account name in the email, we’ll ensure your rebuild occurs ASAP.

This whole episode seems to have been a confluence of events to conspire against us at this stage. We’ve never had a RAID card fail on us before, but we knew we could just swap a working card with a failed one. What we didn’t take into account was the battery backed up cache would actually cause us problems in that case. Even in that case, we were prepared for total hardware failure because we also have our replica for use when the main server fails. We had been monitoring it to see that email was being replicated, but hadn’t realised that the IMAP indexes weren’t being correctly replicated. That’s defintiely something we’ll be adding to our monitoring scripts and working with the IMAP software people to work out what went wrong and fix.

We’ll update in a couple of hours when we have some more details on the rebuild progress and when we have some more status updates, as well as working our what went wrong with the index replication.

Quick restart of imap5

I’ve had to quickly restart imap5 to deal with a failed service. Estimated downtime 5 minutes.

update: estimate more like 15 minutes now. Sorry. It looks like a full host restart is required

update2: and the host didn’t start back up again. We have techs looking at it now. Unfortunately, replication isn’t at the 100% working stage, so we can’t just cut over to the replica machine.

update3: the RAID card in the unit has failed. These are a cold-swappable part, and we have a spare on hand. We’re getting the techs to switch it now

update4: RAID cards may be cold-swappable, but they also have 1Gb of battery backed memory, and when the fail they leave the drive arrays looking a little worse for uncommitted-data.  We have recovered enough from it to switch users over to the replica though.  We’re doing it one user at a time (actually, 4 users at a time).  Will give an ETA when I have one

Follow

Get every new post delivered to your Inbox.

Join 2,294 other followers