server3 down for 20 minutes

One of the backend servers (server3) was just down for about 20 minutes. We’ve just been upgrading kernels on our frontend servers and had temporarily stopped all the regular 2 minute checks when one of the backends experience problems so it was a bit longer than normal before we got to it. Everything should be running normally again now.

Replication and server status update

After the server problem about 3 weeks ago (see previous posts on this blog for more details), we had switched all users over to the replica server of the server that failed. Unfortunately at the time we found that due two separate bugs:

  1. The replica was a few days behind the master
  2. The data in the replica couldn’t be “accessed” without a reconstruct of the mailbox

Since that change over, we have:

  1. Fixed the original server (the problem turned out to be a motherboard problem in the end, not a RAID controller problem after all)
  2. Found all the messages missing from the replication process and copied them across
  3. Found and fixed the bug that caused the messages not to be “accessible” in the replica
  4. Restarted replication back to the original server and brought everything up to date again
  5. Monitored the replication for a couple of days and have done folder by folder compares to check that everything had been replicated back and that it was being kept up to date

Since that was all good, we have just swapped back replication so that the old master is now again the real master server. This all went pretty smoothly which is nice to see and what should have happened in the first place!

I will shortly be setting up processes so that we continuously monitor the replication engine and are notified if it every bails out for any reason, and also every week we do a complete folder to folder compare between the master and replica to look for any discrepancies (and hopefulyl there should be none)

One backend server (imap5) down

One of our backend servers will be down shortly for while we switch back to it’s replication pair. This should only take a couple of minutes.

Update: All done. Everything should be working normally again now

Web site down for a little while

One of our web servers froze up, but was taking a long time to timeout connections, causing lots of incoming connections to “freeze up”. I’ve taken the web server out of production while we work out what went wrong. Everything should be working normally again.

Email restore status

The email restore has completed, and any emails that were missing should now be restored. In some cases, there may be doubled up email, or old email that had been deleted but was restored. Unfortunately working out if an email had been deleted wasn’t possible in some cases, so we restored those emails. We felt that in the cases where we didn’t know if the problem was that the email hadn’t replicated, or that it had been subsequently deleted, it was safer to restore and people could just delete it again, rather not restoring and possibly loosing it.

We’re soon about to start replicating all data back to the what was the old server and bring it back up to date with all emails that have been delivered since the cut over. We’ve also upgraded the software on both the master and replica server, so that if we need to cut back over, we won’t have the same rebuild problem that caused the long outage last time.

Missing email restore

The restoration of missing emails is now about 50% complete. Most users don’t have any missing emails, and those that do, it’s usually only a few.

Server2 rebooted

Server2 has been restarted. Estimated downtime is about 10 minutes.

update: back up and running now

Copying missing emails

The copying in missing emails process was going slower than we expected. With our initial calculation we thought it would only take about 12 hours, but after 12 hours we were only about 1/5 of the way done. We’re not entirely sure why the initial calculation was so far out, but what we’ve done is now start a number of the processes in parallel, so hopefully it should complete faster now. I’ll update in a few hours with more status details.

Started copying missing emails, original replication problem found

We’ve completed the script that finds emails that were on the master server but weren’t replicated correctly. This is now running and copying the missing emails directly into the correct folders so they should appear as soon as we check your account. I’m not sure how long this will take to run, but it will be at least a couple of hours.

We’ve also tracked down the problem that caused the original IMAP replica indexes to be corrupt. We’ve fixed the IMAP server so that this isn’t a problem in the future, and are also submitting the changes back to the maintainers of the server code (cyrus).

One other issue we discovered. Some users rules (eg. sieve scripts) were a few days to a few weeks out of date as well. That means an old set of rules was being used rather than the most current set of rules. This has now been fixed as well, all accounts have the correct and up to date rules set being used for delivering email.

All rebuilds complete

All user rebuilds have now completed, and all users should be able to login again.

As mentioned below, there may still be some missing emails from less frequently visited folders. We’ve started work on a program to find these emails from the rescued existing email store and copy them across to the new store. I’ll update with more information when it’s done.

We’ll also start work on finding out why the replication created bad indexes, and work with the IMAP server people (cyrus) to fix this.

Follow

Get every new post delivered to your Inbox.

Join 50 other followers