Status (by brong at Wed Apr 20 07:49 UTC)
We just had a short outage due to our internal DNS servers dropping out temporarily. Everything is back to normal now.
Status (by brong at Wed Apr 20 07:49 UTC)
We just had a short outage due to our internal DNS servers dropping out temporarily. Everything is back to normal now.
Status (by brong at Fri Apr 15 14:34 UTC)
We’re experiencing much higher than normal incoming email volume at the moment, which is causing a delay on message delivery. It’s currently about 15 minutes from receiving an email until delivery, while it’s usually under 5 seconds. We are investigating to see if there’s any abuse patterns in the incoming email.
Update (by brong at Fri Apr 15 15:09 UTC)
The backlog is cleared – it looks like a large spam run, but it’s being blocked now.
Message delivery times are under 10 seconds again.
Status (by robm at Thu Apr 14 00:45 UTC)
An upgrade to one of our servers caused a problem with non-english characters when sending SMS messages.
Users will have received a "Invalid message content/user data" error message.
This has now been fixed.
Status (by brong at Tue Apr 12 07:53 UTC)
We have a short outage on one IMAP server, estimate downtime is 5 minutes.
Update (by brong at Tue Apr 12 08:08 UTC)
Everything is running normally again
Status (by brong at Mon Apr 4 20:35 UTC)
We’re having a short outage due to an overload on one server. Fixing it now.
Update (by brong at Mon Apr 4 21:00 UTC)
The outage is affecting approximately 10% of users and is estimated to be resolved in about 15 minutes.
Update (by brong at Wed Apr 6 05:11 UTC)
I should complete the story on this one…
The initial problem was caused by somebody uploading a file via IMAP which had been re-encoded strangely (MIME encoding in the Received headers) – this confused things due to a bug in our mail server.
In the process of fixing the first problem, many cache files were corrupted on the same server due to another bug. It’s sounding like there are a pile of bugs, but they’re all in quite rarely used recovery paths except the first one… anyway.
Fixing this caused very high IO load – and meanwhile we were returning incorrect results to the web servers, causing THEM to lock up. So I shut down the server with the corrupt caches on them while I could rebuild them. This was the actual visible outage – first to everyone while the web servers were locked, then just to the users on that one server.
Finally, while we were doing this, the "reconstruct" command was also causing more replication traffic to the other servers, because as soon as it’s finished it replicates the mailbox to make sure the other end is in sync. One of our other servers chose this moment to lock up due to excessive IO load.
So it was multiple cascading failures. We have two brand new IMAP servers arriving today, which will help spread the load a bit more – absorbing all of operamail.com did add some overhead. The bugs identified have been fixed.
Apologies for the lack of update at the end of the issues. It was 2:30am for me by then, and once things were stable I collapsed!
Bron.