Fully automated failover is indeed dangerous, and should be avoided if possible. But a complete manual failover is also dangerous. A fully automated manually triggered failover is probably a better solution.
A synchronous replication solution is also not a complete solution. A split-brain situation is a good example of a failure which could happen. Of course most clusters have all kinds of safe guard to prevent that, but unfortunately also safe guards can fail.
Every failover/cluster should be considered broken unless:
- You've tested the failover scripts and procedures
- You've tested the failover scripts and procedures under normal load
- You've tested the failover scripts and procedures under high load
- You've tested it since the last change in the setup and/or application
- Someone else tested the failover scripts and procedures
- Everyone working with the cluster/failover setup is properly trained
And a failover/cluster solution is not a backup nor a complete disaster recovery solution.
Just to name a few of the failures with HA setups I've seen:
- Master and Standy both failing with a full disk because of binlogs
- Network hicups causing failovers to a standby with a cold cache
- Failovers triggered by human mistakes (quite often)
- Failovers because of bugs in the cluster software
- Failovers due to someone removing the table which was monitored by the loadbalancer
- Failovers due to a run-away monitoring check from the loadbalancers
- Failovers due to failover software on a lower level taking 'smart' decisions (Veritas Cluster Server on VMware ESX with VMware HA)
No comments:
Post a Comment