After the server problem about 3 weeks ago (see previous posts on this blog for more details), we had switched all users over to the replica server of the server that failed. Unfortunately at the time we found that due two separate bugs:
- The replica was a few days behind the master
- The data in the replica couldn’t be “accessed” without a reconstruct of the mailbox
Since that change over, we have:
- Fixed the original server (the problem turned out to be a motherboard problem in the end, not a RAID controller problem after all)
- Found all the messages missing from the replication process and copied them across
- Found and fixed the bug that caused the messages not to be “accessible” in the replica
- Restarted replication back to the original server and brought everything up to date again
- Monitored the replication for a couple of days and have done folder by folder compares to check that everything had been replicated back and that it was being kept up to date
Since that was all good, we have just swapped back replication so that the old master is now again the real master server. This all went pretty smoothly which is nice to see and what should have happened in the first place!
I will shortly be setting up processes so that we continuously monitor the replication engine and are notified if it every bails out for any reason, and also every week we do a complete folder to folder compare between the master and replica to look for any discrepancies (and hopefulyl there should be none)