Daniël's Database Blog: September 2012

After the GitHub MySQL Failover incident a lot of blogs/people have explained that fully automated failover might not be the most optimal solution.

Fully automated failover is indeed dangerous, and should be avoided if possible. But a complete manual failover is also dangerous. A fully automated manually triggered failover is probably a better solution.

A synchronous replication solution is also not a complete solution. A split-brain situation is a good example of a failure which could happen. Of course most clusters have all kinds of safe guard to prevent that, but unfortunately also safe guards can fail.

Every failover/cluster should be considered broken unless:

You've tested the failover scripts and procedures
You've tested the failover scripts and procedures under normal load
You've tested the failover scripts and procedures under high load
You've tested it since the last change in the setup and/or application
Someone else tested the failover scripts and procedures
Everyone working with the cluster/failover setup is properly trained

Just like they do on the Mythbusters show: first verify if it's true and then do everything you can to make it go BOOM!

And a failover/cluster solution is not a backup nor a complete disaster recovery solution.

Just to name a few of the failures with HA setups I've seen:

Master and Standy both failing with a full disk because of binlogs
Network hicups causing failovers to a standby with a cold cache
Failovers triggered by human mistakes (quite often)
Failovers because of bugs in the cluster software
Failovers due to someone removing the table which was monitored by the loadbalancer
Failovers due to a run-away monitoring check from the loadbalancers
Failovers due to failover software on a lower level taking 'smart' decisions (Veritas Cluster Server on VMware ESX with VMware HA)

Daniël's Database Blog

Thursday, September 20, 2012

Automated MySQL Master Failover