Imap server ‘imap5′ is rebooting

I’ve just had to reboot the server ‘imap5′ for an urgent firmware upgrade to the disk controller. It will take approximately 5 minutes.

Bron.

Outbound mail delayed

Some outbound mail has been delayed due to a flood of messages to our outbound server which caused it to process mail more slowly than normal.  The cause of the flood has been stopped, and messages are still clearing the server.  It appears some messages could have been delayed for a small number of hours.

We expect all queued emails to be sent within the next 1/2 hour, and all new emails to send immediately after that.

All services down

Currently it appears that all services our down. It appears the main database server has crashed. We’re investigating to find out why and bring the machine back up.

Update: The database server was rebooted and is now running normally again

Current outage

We just had a problem with one of our IMAP servers. It caused things to slowly pile up, such that eventually we ran out of database connections, which caused this outage. There are a number of options that we’re pursuing to stop this happening again.
Service should be restored to all users now, but we’re currently throttling incoming mail delivery to make sure we don’t run out of connections again, so you may experience a delay in mail delivery to your account. The queues should be empty (and thus the delay removed again) in an hour or so.

Update: All queued email has been delivered now and is being delivered normally

All users moved

All active users have now been moved off the damaged drive.

User moves update

All Full and Enhanced users have now been moved off the affected partition to one of the new replicated servers and should be working fine

Guest and Member users are currently in progress about about 50% done. At the currently rate, they will be completed in about 6 hours.

User moves in progress

The moving of accounts has been going well. We’re doing Full and Enhanced users first, and then will move on to Guest and Member users. As each user is moved to the new server, the account is then immediately available, so if it’s urgent, please check your email each hour or so to see if your account has been moved and re-activated. It will probably take another 12 hours to complete all Full/Enhanced accounts, and probably about 12 hours after that to complete all the other accounts.

It’s extremely annoying that this corruption occured, and IBM haven’t been able to shed much light as yet to why it occured. The good news is that the move to the new cabinets was so that we could move to our new mirrored and replicated setup. We now have new servers setup so that every email account will be replicated between two separate machines. Over the next couple of weeks we’ll be moving all users to these new machines.

server2 back up

server2 is now up and running again. A quick check shows that 80% of server2 users are not on the affected partition. Those users can now access their email and queued email is being delivered. The remaining 20% of users I will begin moving to our new replicated servers ASAP. After the move starts, I’ll be able to make an estimate of time of how long the move will take.

server2 status

We’ve now talked to IBM support and have tracked down the problem. It appears one of the RAID arrays on the machine has some “bad stripes” on it. They’re unsure of exactly why it would have happened. This means that part of the data on one array has been corrupted. Fortunately this machine has 3 arrays, and the majority of users are on the other 2 RAID arrays, and those are fine. Unfortunately the operating system files are on the corrupted array, so the machine wouldn’t boot.

We’re now trying to install an OS image on the partitions that are still ok so we can boot the machine. Once this is done, we’ll get the 2 working arrays online ASAP, and move the users from the 3rd array off there onto our new replicated servers, the entire point of the move in the first place! For any corrupted data, we’ll restore from or nightly incremental backups.

server2 will be down for a while

One of the IMAP servers (server2) has some corruption on it’s main partition stopping it from booting properly. We’re currently trying to fix this. It may take a couple of hours. All email for users on this server is being queued and will be delivered when the server has been restored.

All other IMAP servers are up and running.

Follow

Get every new post delivered to your Inbox.

Join 50 other followers