Every single user is now on a replicated server. Some of the users from server3 are not yet fully replicated because we prioritized getting you up and running over making the replication run, however there’s now a task replicating users. The need to get replication up as fast as possible was a major factor in choosing the recovery technique we finally went with (delete missing messages from your index files now and re-deliver them with new UIDs to your folders when they become available rather than leave them in the index and have them appear “blank” in the web interface) This technique means replication will work correctly.
The ‘sync_all_users’ program has resource rate controls that ensure it doesn’t overload either the source or destination servers, so I’m leaving it running through the US day. Based on the rate it’s been running I estimate that all users will be fully replicated in about 18 hours from now, and of course backups will still run as usual for all users.
There are no users left on the old server3 now, the last guest migrated off about 1/2 hour ago – I’ve been running these migrations concurrently with the other tasks so we can safely do whatever is needed to server3 to recover the missing messages.
The filesystem check over our last 2Tb volume looks like it still has a couple of days to go (hard to be precisely sure, some sections run faster than othes depending on the disk layout of the filesystem) before we can deliver the messages from that gap between the last backup and the failure.
The good news, this will never happen again (update: see below). The bad news, one more week of moving users at the rate they were moving (over 2/3 of the non-guest users on that drive had already been moved) and it wouldn’t have happened in the first place!
Update: To clarify this statement and our current setup a bit more.
- All users now have their email stored on a system with RAID disks (4 drive x RAID 5 for email data, 2 drive x RAID 1 for metadata) and all servers and RAID arrays have dual power supplies, so a single drive or power supply failure should cause no interruption to service at all, we just replace the drive/power supply while the system is live and online. Hard drives and power supplies are the most common failing hardware components in computer systems
- All users now have their email replicated to an identical replica system (RAID drives, dual power supplies, etc). Each system is completely separate; it’s own operating system, filesystem, drives, power, connections, etc. The replication is performed at the semantic email level, not at the filesystem level. So a filesystem corruption on the source server will not be replicated. This means if there is a disk or filesystem corruption on a single machine, we can just switch to the replica and it won’t cause a multi-day outage. The failover is not automatic, it is manual. Thus depending on the actual problem that occurs and our ability to analyse and respond, it should be on the order of minutes to an hour to fail over to a replica if we decided it’s needed. In some cases, we may decide it’s easier and safer to reboot a “frozen” machine than failover, so it might be possible to still have outages up to an hour. If we believe the outage is going to go over that time, we will most likely failover to the replicas.
We can also use the failover ability to do maintenance on machines more easily. If we decide a machine needs servicing (kernel upgrade, hardware change, etc), we can just failover all master processes to their corresponding replica machines safely, do the work, start the machine up again and wait for replication to catch up, then restore master processes back to the machine. For users, the only visible downtime will be the controlled failover portion, which is usually on the order of 1 minute or so.
- All users have their email store backed up incrementally each night to a separate system and RAID array. The backups of email are kept for 1 week after the email is deleted to allow restoring in case of accident
We still can’t guarantee that muti-machine crashes/corruptions won’t occur which might cause problems, but it is far, far less likely.