On the night of August the 19th, FastMail.FM went down due to corruption of the main database. The primary backup of the main database is a realtime replica. This can provide near instantaneous recovery from errors caused by hardware failure, and many types of software failure. Unfortunately, in this case the corruption was also replicated to the primary backup.
As a result, we had to recover from the secondary backup, which turned out to be a very lengthy process, causing FastMail.FM to be down all night. We also found that the secondary backup had consistency problems in the addresses, and billing, tables. This meant that we had to use an outdated addresses table until the consistency problems were resolved, such that addresses that you had recently added to your address book were not available for a couple of days. Furthermore, all billing related services were down for this time (such as renewals, upgrades, and signups). Because so much mail was queued up on our secondary US and European mail servers overnight, the mail queue took much of the day to clear, resulting in mail delivery delays.
All of these problems are now fully resolved, with the exception that some groups stored in some users’ address books are missing some addresses. We are currently in the process of e-mailing the small number of users affected by this.
After four years of operation with no extended, unscheduled outages of this type, this week’s power outage and database corruption problems are most disappointing. We understand the inconvenience that this caused many of you, and we are doing everything that we can to ensure that it does not happen again. We have already instituted a new secondary backup policy of the database, which involves taking a complete nightly backup, and storing it on our backup server in Texas, which will result in much simpler and faster restores then using incremental backups to tape. If we ever have to restore from secondary backup again, we would not expect it to take more than one hour.