Every user on replicated system

Every single user is now on a replicated server. Some of the users from server3 are not yet fully replicated because we prioritized getting you up and running over making the replication run, however there’s now a task replicating users. The need to get replication up as fast as possible was a major factor in choosing the recovery technique we finally went with (delete missing messages from your index files now and re-deliver them with new UIDs to your folders when they become available rather than leave them in the index and have them appear “blank” in the web interface) This technique means replication will work correctly.

The ‘sync_all_users’ program has resource rate controls that ensure it doesn’t overload either the source or destination servers, so I’m leaving it running through the US day. Based on the rate it’s been running I estimate that all users will be fully replicated in about 18 hours from now, and of course backups will still run as usual for all users.

There are no users left on the old server3 now, the last guest migrated off about 1/2 hour ago – I’ve been running these migrations concurrently with the other tasks so we can safely do whatever is needed to server3 to recover the missing messages.

The filesystem check over our last 2Tb volume looks like it still has a couple of days to go (hard to be precisely sure, some sections run faster than othes depending on the disk layout of the filesystem) before we can deliver the messages from that gap between the last backup and the failure.

The good news, this will never happen again (update: see below). The bad news, one more week of moving users at the rate they were moving (over 2/3 of the non-guest users on that drive had already been moved) and it wouldn’t have happened in the first place!

Update: To clarify this statement and our current setup a bit more.

  1. All users now have their email stored on a system with RAID disks (4 drive x RAID 5 for email data, 2 drive x RAID 1 for metadata) and all servers and RAID arrays have dual power supplies, so a single drive or power supply failure should cause no interruption to service at all, we just replace the drive/power supply while the system is live and online. Hard drives and power supplies are the most common failing hardware components in computer systems
  2. All users now have their email replicated to an identical replica system (RAID drives, dual power supplies, etc). Each system is completely separate; it’s own operating system, filesystem, drives, power, connections, etc. The replication is performed at the semantic email level, not at the filesystem level. So a filesystem corruption on the source server will not be replicated. This means if there is a disk or filesystem corruption on a single machine, we can just switch to the replica and it won’t cause a multi-day outage. The failover is not automatic, it is manual. Thus depending on the actual problem that occurs and our ability to analyse and respond, it should be on the order of minutes to an hour to fail over to a replica if we decided it’s needed. In some cases, we may decide it’s easier and safer to reboot a “frozen” machine than failover, so it might be possible to still have outages up to an hour. If we believe the outage is going to go over that time, we will most likely failover to the replicas.

    We can also use the failover ability to do maintenance on machines more easily. If we decide a machine needs servicing (kernel upgrade, hardware change, etc), we can just failover all master processes to their corresponding replica machines safely, do the work, start the machine up again and wait for replication to catch up, then restore master processes back to the machine. For users, the only visible downtime will be the controlled failover portion, which is usually on the order of 1 minute or so.

  3. All users have their email store backed up incrementally each night to a separate system and RAID array. The backups of email are kept for 1 week after the email is deleted to allow restoring in case of accident

We still can’t guarantee that muti-machine crashes/corruptions won’t occur which might cause problems, but it is far, far less likely.

Server3 users access restored

We’ve now restored web and IMAP access to most server3 users, the final ones will follow shortly

For now we have had to restore all users from backups. There are three main issues users will notice because of this:

  1. Some emails deleted in the last week may have been restored. In a few cases, we couldn’t determine correctly whether an email had been deleted or not, and decided it was safer to restore these from the backup than not restore them
  2. Junk Mail folders will initially be empty. We don’t backup Junk Mail folders nightly
  3. Email delivered in the time between the last backup and the crash will not currently be present (approximately 10 hours between 6:30pm EST Oct 26 to 4:30AM EST Oct 27). We’re currently still rebuilding the RAID array with those emails on it. When that’s complete, we will copy the emails into place into users mailboxes.

All emails delivered on Friday and over the weekend were queued on our mail servers, and have now been delivered to the new mailboxes. It’s only email between those times listed above which were delivered to the existing mailboxes after the backup and before the crash that still need to be restored.

External IMAP access blocked, but web interface access coming back for server3 refugees

We have restored all meta data and some backups from server3.  We’re now re-enabling accounts, but those who had messages delivered since the last backup will only be enabled for Web access, not IMAP access.

The reason for this restriction is that Cyrus will return an IO error if you try to access a message which hasn’t restored yet.  This doesn’t affect the web interface, but could cause an IMAP client to get confused and lose your email.  As we restore backups we will enable IMAP access again.

Server3 restarting

For those few on server3 who aren’t on the dead partition, we’re just restarting it so we can attach the new SATA enclosure with the old disks in it (we had a spare enclosure shipped recently so we could move disks in case of an emergency) to do the rebuild.

Update: server3 is back up and the drives have detected fine.  Back to checking them.

server3 status update

Unfortunately the restore is taking longer than expected. After the corruption on the RAID array crashed the kernel, we restarted the server and started a fsck on the volume to fix the corruption. This would normally take about 12 hours, after which would should be able to remount the volume and get users access to their email again.

Sometime during the fsck however, the RAID array started reporting multiple errors from different drives and then two drives in the unit “died” in short succession and started returning IO errors to the operating system and causing the fsck to stop.

We replaced the drives, waited a while for the volume to rebuild and restarted the fsck. Sometime in the last 2 hours the RAID array has now totally died. It’s not responding on it’s SCSI interface or on it’s web interface.

At this point we’re about to:

  1. Start doing restores from our nightly backup onto another server
  2. We’ve still got the existing meta data on a separate volume which is intact, so we’re going to copy that across. This means message lists, seen state, etc will be preserved
  3. New messages delivered between the last backup and the crash will appear in message lists, but attempts to read them will fail as “empty” messages

In the mean time, we have spare RAID array units. Fortunately the RAID arrays we use store their configuration data on the drives themselves. So we’re hoping that by pulling out the drives from the dead array, and putting them into a new spare array, they should just come back online and in their correct configuration. We’ll then run a fsck on the volume with the new unit which should then allow us to recover the data between the last backup and the crash and we can copy that into place for a complete restore.

There was no sign that this RAID array was about to fail in any way. This RAID array has been running fine for 2 years and it was only in the last 24 hours that things have started going wrong with it. It’s because of these single points of failure that we have been moving to a replicated setup. It’s unfortunate that the one unit that died was the one which wasn’t part of our replicated user infrastructure.

server3 down again

Server3 crashed again. We’re going to have to take it down over the weekend for a full cleanup, but we’re hoping it keeps running through the US day.

Update: The server is too unstable and crashed again. We’re taking it offline and doing a full clean and then moving users off.

The good news: We were already in the process of moving users off this server, so there’s not too many left on there
The bad news: We have to fsck the volume first before we can start moving users which will take a day
The good news: Once we’ve moved these users, all users will be on replicated storage
The bad news: After the fsck, it will probably take another day or two to move the remaining users
The good news: It’s very likely that everything will be fixed and restored and moved by the end of the weekend

Affected paying users will be compensated for the downtime. We’ll post updates as we have more information.

Another server is offline briefly, I’m restarting it now

Update: I need to get NYI to reboot the RAID controller as well, which is what’s taking so long, will update you once that’s done.

Update: They’ve restarted the RAID controller. Looks like one drive failed, and the array got confused rather than keeping on working as RAID6 promises it will. I’m having the drive replaced now.

Update: Back up again – it will run a little slower as the drive array rebuilds itself onto the new disk, but otherwise it’s working fine as far as I can tell.

Server down

One of our IMAP servers is down. We’re investigating

Update: The server appears to have crashed with kernel errors. We’re getting NYI to reboot it.

Update: Server has been rebooted and all services started again. Everything should be back to normal

One of our servers down

One of our imap servers has gone down. There are full replicas on other servers, and I am in the process of transferring to them. It will take up to 10 minutes.

Bron.

Update: all IMAP services are back up, and we’ve restarted the down server and got replication back up and running

Site outage

Almost all services were down for about 1 hour. We’ve got everything running again now and are working out what happened.

Update: We’ve found the root cause of the problem. An overload on the web servers caused a chain restart of systems which didn’t restart properly leaving many in a down state. We’re updating the code to ensure this doesn’t happen again

Follow

Get every new post delivered to your Inbox.

Join 50 other followers