With vSphere 5, VM HA gets a ground up re-write – which is, IMO long overdue.
I can remember the first time I was playing in a home lab with VMware Infrastructure – the VM HA configuration wouldn’t work, and it SUCKED to troubleshoot. 3 guesses on what my problem was… well, only 1 guess needed… DNS :-) The “rule of 4” around DNS was super-important for the oh-so-super sensitive VM HA agents to play nicely. And OH, the error messages and logging always was a source of frustration.
Ironic of course – as this was built on the EMC Autostart code :-)
VMware has given it a ground-up re-write. While on the surface, you may not immediately realize just how different things are, but they are.
Read on for a detailed explanation.
BTW – I want to thank VMware – these diagrams are perfect, so I hope it’s OK that I’m using them to explain.
First of all – each ESX host still has an agent running in the vmkernel, but it’s now called a “Fault Domain Manager” (FDM) agent. Each cluster only has a single FDM Master at any point – and the other nodes are now Slaves. There’s a couple important notes here: No more primary/secondary = some of those prior “weird best practices” which were manifestations of this underlying primary/secondary construct can be simplified. As examples, think spanning clusters across blade chassis’ or the VPLEX stretched cluster support requirements for a total of 8 nodes, evenly balanced and configured “just so”.
An FDM master monitors ESX hosts and Virtual Machine availability, and manages two core lists – the list of hosts, and the list of protected VMS. Slave nodes maintain a direct point-to-point encrypted TCP session with the Master (another improvement over the old approach – both in terms of networking simplicity and security hardening), and are responsible for VM-level monitoring (which has no dependency on anything other than themselves).
The other major change is the use of BOTH networking AND storage as a mode for communication and maintaining state. This is likely the first thing people will see that has them saying “huh?” – during the vSphere beta, it did for me, as you need to have 2 shared datastores to configure VM HA – and the first time I saw that I knew something had changed.
The purpose of this is to help during network partition, and also that that master maintains the state of the two lists it handles (the ESX hosts, and the VM protection state) on the heartbeat datastores (in the case of VMFS in a file located in hidden area used of all sorts of things, and in NFS in a file).
This function is very useful for stretched clusters. As we update the VPLEX VMware support stance and KB article as we approach the GA target, you’ll see that the solution has hardened and simplified (also due to APD and PDL behavior).
Beyond these changes – the other things that people will see right away is that that VM HA configuration occurs much, MUCH faster, and also that VMs boot faster (due to changes in how VM protection state is handled).
Long and short – I’m glad to see VM HA get a ground up re-write. The older AAM approach was showing it’s age, and starting to create funky “weirdness” that wasn’t necessary as we built bigger scale, stretched clusters, and more.