Thursday, September 20, 2012

Automated MySQL Master Failover

After the GitHub MySQL Failover incident a lot of blogs/people have explained that fully automated failover might not be the most optimal solution.

Fully automated failover is indeed dangerous, and should  be avoided if possible. But a complete manual failover is also dangerous. A fully automated manually triggered failover is probably a better solution.

A synchronous replication solution is also not a complete solution. A split-brain situation is a good example of a failure which could happen. Of course most clusters have all kinds of safe guard to prevent that, but unfortunately also safe guards can fail.

Every failover/cluster should be considered broken unless:
  1. You've tested the failover scripts and procedures
  2. You've tested the failover scripts and procedures under normal load
  3. You've tested the failover scripts and procedures under high load
  4. You've tested it since the last change in the setup and/or application
  5. Someone else tested the failover scripts and procedures
  6. Everyone working with the cluster/failover setup is properly trained
Just like they do on the Mythbusters show: first verify if it's true and then do everything you can to make it go BOOM!

And a failover/cluster solution is not a backup nor a complete disaster recovery solution.

Just to name a few of the failures with HA setups I've seen:
  • Master and Standy both failing with a full disk because of binlogs
  • Network hicups causing failovers to a standby with a cold cache
  • Failovers triggered by human mistakes (quite often)
  • Failovers because of bugs in the cluster software
  • Failovers due to someone removing the table which was monitored by the loadbalancer
  • Failovers due to a run-away monitoring check from the loadbalancers
  • Failovers due to failover software on a lower level taking 'smart' decisions (Veritas Cluster Server on VMware ESX with VMware HA)