STONITH/Fencing: Why You Need it
If you’re running a cluster environment with shared resources, you need to have STONITH or some sort of fencing running.
A cluster is a complicated beast. It’s a community of machines that makes decisions. The decisions can be simple and only affect the cluster (i.e. which service runs where and which client is responding to requests), or they can be complicated and communicate with outside devices or humans.
Fencing means that the cluster has the ability to kick a node out of the cluster if it starts doing things that the cluster doesn’t like, possibly because of a hardware failure, software/disk corruption, or network/transport layer fault. Unfortunately, with the node still running but fenced out of the cluster, the node may still be able to pass bad information to other clients who aren’t integrated with the cluster manager.
STONITH stands for “Shoot The Other Node In The Head,” and is a linux service implemented by stonithd. It uses a “definite” method to make sure a node is really actually down. This ‘definite’ method is ideally a redundant part in the machine that it servers, or something outside of the machine. One example is Dell’s DRAC (Dell Remote Administration Card) or HP/Compaq’s ILO (Integrated Lights Out) plugin components. These cards typically have their own battery and their own network interface that can tell the cluster for sure if the machine’s power is on or off, and can change that status.
Not all STONITH devices are created equal. For instance, the DRAC/ILO cards or a blade chassis’s backplane are examples of good devices because their operation is independent from the machine itself and it’s reliable. If you can’t reach it, you know the network transport layer is wonky between the cluster with quorum and the misbehaving node.
Another type of STONITH device is managed PDU (like the APC 7900 series) — which has a network interface that uses broadcast or multicast packets to share information and form sort of a cluster of their own. Ideally, you could use this to shut down or power cycle a misbehaving machine. In practice, no part of this method communicates with the server itself, so you don’t get a clear verification that it worked. If your PFY has moved the plugs around without updating the PDU software, if the PDU that has the machine’s second power supply on it is down, if the network is FUBAR and the 2nd PDU doesn’t hear the broadcast, or if the PDU’s software itself (which has a deservedly bad reputation) goofs up … well, all of those things could leave the unclean/insane/zombie node online and serving out bad information to your clients. You need a way to verify that the power is off to the machine even if it’s network card is down, period, or you don’t KNOW it’s down.
I’m considering always running a suicide device on each individual node. That way, a node that has fallen out of touch will power itself down. I like the idea, but haven’t tried it (or seen discussion of it) in person.
Last but not least, run an odd number of machines in the cluster so that they have enough votes to form a quorum all the time. A quorum is defined as > 50% of the machines in the cluster. Not >= 50%, but > 50%. Two machines won’t do it in a four-node cluster. Three will. Make sure you test your failure modes before you go to production.