mail flood causing some delayed emails

We were just hit by a mail flood which is causing some emails to be delayed by up to 1/2 hour as they clear their backlog of messages.  The flood is stopped now, and the queues are clearing.
Bron.

short outages for some users

apologies for the couple of short outages we’ve just had for some users.  The monitoring system got a little over-eager and noticed errors in the new stores I’m setting up (basically our internal test users weren’t in place yet) and decided to restart services on an entire server.

imap5 restart

I need to restart imap5 briefly to repartition parts of the drive for slots. This will mean a downtime of about 20 minutes for users on imap5m.internal.

edit: make time to upgrade the kernel while I’m here – seems it’s running an older kernel than all our other imap hosts thanks to being the oldest of the new bunch!

update a while later: it all came back up fine, though I wound up transferring the master store over to the replica to balance the load out and shorten the outage.

Bron.

server4 short downtime

There was a short outage due to a mistake I made while moving users to our new replicated systems.

All running again now.

Bron.

Server3 problem – summary

Now that the we’ve restored service to all the users affected by the disk failure which happened last Thursday, we’ve got time to talk in more detail about what happened.

Apology

First of all we’d like to apologize for the interruption to service. We understand the inconvenience that this caused, and we’re continuing to improve our infrastructure to prevent this from happening again.

A summary of the problem and its resolution

The problem began on Thursday, when a partition on server3 started giving errors.

On finding that the affected partition failed to mount, we needed to run a filesystem check (‘fsck’). This would hopefully fix up any errors in the filesystem, and we’d then be able to continue to use that partition. In any case, we would certainly be unable to use the partition until the fsck was completed, so we started the fsck.
We set the status of the affected users to a special value (“Being moved”) in order to stop the mailboxes being used, and then changed the configuration on the server to allow normal access for all the other users on server3. Once this was done delivery of messages was able to proceed again.

When it became apparent that the fsck would take a long time (it ended up taking two and a half days), with no guarantee of fixing the problem, we decided to restore our latest backups of the affected users onto some spare disk space. So, in the worst case situation where the fsck failed to make the drive readable, we would have a significant head start in getting email up and running again. This wasn’t our first preference, though, because it would mean that any emails received since the latest backups would be lost.

Meanwhile we were monitoring the other servers, and making sure that all the messages banking up didn’t overwhelm our incoming mail servers. The best strategy turned out to be to move messages which were addressed to the affected users to a ‘hold’ queue, where they wouldn’t time out and generate bounce messages to the sender.

The fsck finally completed, and told us that there were no problems with the file system, and everything was clean. However, whenever we tried to mount the infamous partition, the filesystem driver in the kernel would give an error. Curses! After some investigation we found that we were able to mount the partition read-only, so at least we could read the most up to date version of the mailboxes.

So, the plan then became to synchronise the already restored backups with the read-only partition. This was relatively fast, because the lion’s share of the data was already in place, only modifications that had happened since the last daily backup would need to be copied.

We started this process, and set it up so that each user would become active as soon as that user was synchronised. We prioritised the users so that the users who were currently trying to log in and access their mail would be restored before users who were making no attempts to log in.

Once this process had begun, affected users began to have service restored to them. So the first users suffered an outage of a bit under 3 days.

After we verified that this was working, we began to release queued mail from the ‘hold’ queue on our incoming mail servers, where the destination user was now active.

At a bit over 5 days since the original problem, all users were on line again, apart from a handful (less than 10) that required manual intervention. All users are now online and working again.

Reliability and the future

We take reliability seriously, and we are taking steps to prevent something like this happening again.

We have been engaged for some time in a programme to move our users onto servers which are replicated in real time, so that in the event of a problem, we can just switch over to using the replica server with no loss of data, and only a very short interruption to service. Unfortunately, it’s not as easy as buying twice as many servers and announcing “We’re replicated”. There have been a number of software and hardware problems, and we obviously don’t want to move users onto a system which is less reliable than the current setup.

Some users were already on replicated servers, and we were in the process of setting up and testing another pair of servers when this problem happened. The quickest way to get the affected users running again was to restore them onto the new pair of servers. This forced us to spend a lot of time (while the fsck was running) making sure the new pair was configured, stable and ready to go. When users were restored, they were moved to these new servers.

So the upshot is that the restored users are now enjoying real time replication. We have more servers on order, and are waiting for them to arrive so that we can move all of our users to this replicated setup.

If you’d like some more technical details of the problem and its resolution, please see this forum post.

Final user copied from server3’s failed partition.

The final user has been copied off server3, some 20 hours after the second last user.  I’ve spent the day trying to work out why their move failed.
The answer – a single folder with over 180,000 messages in it.  The copy process was timing out and the server dying.  I had to copy the rest of the user’s mailboxes and meta files across manually then hand edit the mailboxes database so it all matched up.  Finally it passed the replication check and I could enable the user.

Apologies for the lack of updates through this site – at the time I felt there was nothing that had changed and nothing I could add that would make it any better for the people still in the queue. I’m glad it’s over.  A more detailed post about what happened will be made soon, detailing the steps we took and the reasons why we did things the way we did.

Bron.

restores progressing, mail being delivered as users are restored

We’ve got everything pretty much automated and are going to sleep so we’re in a sane space tomorrow to finish up.  We’ll have Jeremy back then too, he’s just got off a flight.

Bron.

Server3 refugees – IMAP should now work

The users who had been moved from server3 could access their email via the web, but not via IMAP after they were recovered onto the new servers. This was due to our IMAP proxy machines not knowing about the new IMAP server addresses in their proxy map. I’ve fixed that now.

Adding yourself to the list of accounts to be restored:

You can now log in to the web interface without needing to use /beta/ – just go to https://www.fastmail.fm/ and log in with your usual username (or full email) and password.
The list is in strict first-come-first-served order, and people who try to log in are automatically pushed above other users in the queue. The only restriction is that users are replicated randomly to one of 8 smaller “‘virtual servers”, and we can only copy one user at a time to each of these because of locking, so you may discover that a later request actually gets processed first because there were fewer requests for users who have been allocated to that server.

Server allocations were made when I restored the backups yesterday in preparation for this eventuality, which is why they are pre-determined before you get on this recovery list.

Bron.

Server3 status update

The filesystem check finished and said there were no errors, but it still wouldn’t mount. Thanks heaps filesystem developers. I managed to convince it to mount it read-only so I can copy files off. I’ve already restored backups on to two other servers as “plan B”, and have now started using the MoveServer infrastructure to copy users across to those machines (the backups already being there makes it significantly faster).

I’m not going to attempt to estimate a timeframe at this point, because I don’t have enough data yet – and I know how unpopular I was last time I tried to make estimates with insufficient data. One good thing about this process, it should be possible to prioritise certian accounts. I’ll work with Richard to make this really easy somehow (as in, automated. Press a button on the web interface to get put in the priority queue).

Bron.

Ok, any login attempts on the Beta site now cause a priority request.  I’m about to add the same feature to the main website as soon as I’m happy it’s working correctly.  I’ll also keep working through the other accounts in the background

servers 1,2 and 4 all being restarted

I’ve set all the “old” servers up with the new kernel version that’s already on server3. This contains some fixes to the filesystem code that should avoid the problem that caused the server3 outage (journal transaction overflow on massive delete).

Expected downtime is 5 minutes.

all servers were back up in about 5 minutes… but I’ve spent a while making sure all the checks work well before coming back to update this

Follow

Get every new post delivered to your Inbox.

Join 50 other followers