Here’s an update of what happened earlier today.
To summarise the overall effect, for about 30 minutes everyone would have had problems and timeouts accessing the web interface, and those users on the affected server would have have been down for about 2 hours.
What actually happened is that a drive in the imap6 server failed. In itself not a problem because all drives are in RAID arrays. So we contacted NYI to get the dead drive removed and replaced with a new one. Unfortunately there was a mixup. In the RAID controller interface, the drives are labelled 1 to 12, however on the actual physical case, the drives are labelled 0 to 11. When we told NYI to remove drive 1, they did that, but because of the labelling on the case starting at 0, that means they actually removed the second drive. This caused the RAID array to go completely dead. Since the RAID array that failed was the one with the operating system on it, the machine ended up in some terribly stuck state.
At that point, we marked the machine as down, which should normally just point affected users to the status blog, and let everyone else keep working. Unfortunately because of the state the machine was in, IMAP connections to the machine did not timeout immediately, but instead froze. This caused all the web processes to be used up, causing access and timeout problems for everyone. Once we realised this was happening, we put a special check in place to stop the web servers even trying to contact the dead imap server, which restored access to everyone not on the affected server.
Back on imap6 the dead server, we got the drives reinserted, and got the machine restarted to find out what state it was in. After getting it back up, we got the drives checked and remounted and did a some tests to see the extent of the damage to see whether we could start using the machine again, or would have to switch to the replica. Unfortunately the removed drives had caused severe corruption on the main partition, so we then switch over all services for users on that server to the corresponding replica servers.
We’re now cleaning up imap6 so that we can replicate data back onto it.
So to summarise the good and bad.
Bad
- One server being down affected all users using the web interface, we’ll work to make that doesn’t happen in the future
- A miscommunication meant that the wrong drive was pulled. Had the correct drive been pulled, there would have been no visible user downtime at all. We’ve made a clear note to make sure that we know at both ends that the drives are numbered differently in the future
- Since this was only the second time we’ve had to use the replication system in a true failure scenario, we were still cautious about doing the switch over and checking that it was required. This meant that it took longer than it should have to get users back online
Good
- The replication setup worked as expected. Once we analysed that the problem wasn’t a short term fixable one, we switched all affected users to their replica. The result was an outage of an hour or two, rather than having to fix a corrupted filesystem which could have taken days. We believe that this should be reduceable in the future as well.