server3 status update

One of the filesystems on one of our servers (server3) failed last night due to a filesystem bug. After a bit of work we were able to bring up the users on other partitions on server3. However, mail delivery was suspended while we were doing this and so there was a period (most of the US day) where incoming mail was being delayed by around an hour. The queues have all emptied back out now, and delivery should be happening normally.

Incoming mail for users on the affected partition is still being queued, but these users are currently unable to log in via the web interface or IMAP.

We’re currently running a filesystem check on the affected disk. Unfortunately, there is no faster option to get that set of mailboxes back online. We apologise for the inconvenience and are working hard to restore service.

I’ll post more here as it becomes available.

Update: 02:00 on the 2nd of September GMT

As of now, I estimate between 17 and 24 hours remain before the filesystem check is complete.

Update: 08:30, 2nd of September GMT

My latest estimate is somewhere between 10 and 14 hours remaining on the disk check

server3 has been down for about 1/2 hour

Server3 died about 1/2 hour ago. I am looking at it now.

Bron.

UPDATE: it was a filesystem bug, most frustrating. I needed to reboot the server, which takes about 8 minutes – so probably another 20 minutes down all together while I check and start everything

UPDATE 2: seems that the filesystem will need a full check, which will take a couple of hours.

Quick restart of all backend servers

All the “old style” servers, server1-server4 need a quick restart to change their config files.  Downtime will be about 2 minutes on each.

Server2 being rebooted

server2 needed to be rebooted, it will be about 5 minutes.

edit: server2 is back up

Backend imap5/imap6 down

Two servers are currently down, we’re investigating

Update: Servers should be back now, performance may be a bit slow for the next 15 minutes as some queues clear

Update2: Ok, the server wasn’t very happy about that, and we’ve had another 10 minutes of downtime. We’ve split over to two separate servers now, that should be better

Slow server3 performance yesterday

Server3 seems to have been overloaded yesterday due to some processes not completing in time overnight causing slow response times for a number of users. The issue should now be fixed.

Server3 being rebooted

Server3 needed to be rebooted suddenly. Estimated outage is 10 minutes.


edit: everything is running again now.

Problems with sieve rules and spam reporting for some users

It seems that while our switch to the replica server yesterday went well, there were two problems we hadn’t realised

  1. Spam reporting wasn’t working properly. It appears the replica server wasn’t correctly responding to spam reporting requests. This has now been fixed
  2. For some users, their sieve scripts (filtering and forwarding rules) hadn’t been replicated, and thus no email was being filtered/forwarded. Unfortunately this affected a number of users that had recently had their accounts moved from another server. I’ve now fixed all the rules scripts for all users, and will be emailing the affected users shortly. As an aside, anyone experiencing rules problems can always go to Options -> Define Rules and click Done. This will cause a users rules to be rebuilt from scratch and re-inserted into the server again.

Our appologies to the inconvenienced users. Our replication system now has a regular test we run to check that everything is working correctly, and I’ll be adding tests shortly to check that sieve scripts are correctly being replicated so this doesn’t happen in the future.

imap6 down

01:51 EST: One of the imap servers (imap6) is currently down. We’re in the process of rebooting it.

02:23 EST: The server didn’t want to come up properly, so we’ve switch to the replica which should be all up to date and running now

ClamAV security update caused restart of some services.

There may have been a couple of error messages for people as an urgent security update was applied to one of our servers, but they would have lasted less than a minute.

The package in question was ClamAV, our antivirus package.

All updates have been done now.