Various outages today

Unfortunately as part of our server and user migration over the last 4 hours, we have had a couple of short and long outages to various services. The most notable is that for the last 2 hours bandwidth usage and the login log have been unavailable. These have now been restored, however approximately 2 hours of login log data is unfortunately missing. Failed logins were still correctly recorded as occuring, however the details of successful and failed logins for those 2 hours are not available in the logs.

We appologise for any inconvenience caused by these outages, we are working hard to setup our new replicated infrastructure and have been hitting some roadblocks along the way that we hadn’t anticipated. We’ve been trying to do any work that might cause problems off peek times to reduce the number of people affected, but obviously many people are still using the service at all times.

Backend servers down

2 separate backend servers are currently down. We’re investigating why.

Update: One server back online, just wait for the other to come back

Update: The second server is not responding to all soft restart attempts, we’re going to have to hard reset it. Should be back in about 10 minutes hopefully

Update: The second server is back up. We’ve temporarily disabled imap logins while some of the mail queue flushes. You can access your email via the web interface. We’ll re-enable imap shortly

Update: IMAP logins have been reenabled. All services should be normal again

server2 outage

There’s currently a problem with server2. We’re restarting it now… hopefully it should be back in a few minutes.

Update: all should be OK again now.

Outage for 30 minutes for some users

One of our new frontend replicated servers crashed, but didn’t give up it’s IP to the other server. This meant that 50% of users (depending randomly on which IP you had been given) would have had trouble connecting to web, IMAP, POP and SMTP for the last 30 mins.

We’re trying to find out why the automatic failover didn’t kick in, and also update our warning systems so that we’re paged faster if a problem like this with the new replicated servers occurs again.

Server1 short outage

Server1 will be offline for about 5 minutes while a software upgrade is done.

Sessions lost

A change in the internal format of the session file meant that everyone’s current sessions were lost. This only affects users of the web interface about 10 minutes ago (sorry, had to fix it before posting!) and is a one off. We deleted all the sessions and restarted everything.

You may have noticed an approx 2 minute outage of the web interface as well as needing to log in again after the outage, even if you had already logged in recently.

Bron.

One backend server (server3) responding slowly, rebooting

One of our backend servers appears to be responding very slowly. We’re going to reboot it.

Update: The server has been rebooted, but is still acting quite slow. We’re investigating further.

Update: We’re still not sure what is causing this server to become extermely slow. We’re working on trying some solutions.

Update: We’ve resolved the main performance problem now. There’s still some email left to flush from the mail queue into mailboxes, but that should be completed shortly.

Some services down

Web or IMAP services may currently be down for some users. It appears one of the frontend servers has frozen in a way that isn’t allowing the other frontend to take up it’s services. We’re investigating.

Update: All services back to normal again.

All services down

Due to a confusion between our two heatbeat servers about which should be running services, they’ve both dropped all services. I’m working on getting them back again!

Bron.

edit: everything’s back again, and I’m doing a lot of log file reading to see why it didn’t work!

Restarts on all backend servers again

Finally removing the split-brain configuration! We’re back to one single repository again.

Bron.