UPDATE: Oct 21, 2009 – folks, this article was very popular, and the core principles still are used by VM HA, but the admittance algorithms have changed a LOT since it was written. Strongly suggest going to the always excellent Yellow-Bricks for a more current update on VM HA internals and admittance algorithm behavior: https://www.yellow-bricks.com/vmware-high-availability-deepdiv/
This has been a question I've gotten a couple times in the last few weeks, and I noticed it's a prominent question in the forums I check out as much as I can. What's the symptom - you can't create new VMs without violating the availability constraints - even when you think you have plenty left, or you upgraded from Virtual Center 2.0.x to 2.5 and can't shake the "Insufficient Resources to Satisfy" HA error. This looks like this for most customers that are hitting this after a VC 2.0.x upgrade (i.e. they can't enable the "don't start machines if they violate availability constraints" option)
or, like this if you're trying to power on a machine and DO have the "don't start" option ticked.
I can tell you that designing these mechanisms (reservations) is always harder than it looks at first glance - which makes them fun engineering problems, but can result in weird behaviors - almost universally with the same symptom: satisfying the reservation takes more than the end-user expects. This is particularly true when the system is composed of sub elements that can all be different - i.e. there is no standard "quantum" element. Geez - sounds like VMware - no? You've got ESX servers in a cluster (which don't have to be all the same CPU and memory config) supporting VMs (each with unique amounts of vCPUs, RAM, and DRS reservations/limits)
So - how exactly does VM HA's admittance algorithm work? It's not a secret, and not complex, not a bug, but also not obvious... Read on.....
First - here are some discussions on this topic if you want to immerse yourself (and get into the habit of frequenting some solid sites)...
This fellow has a lot of good things to say on the topic also...
A big thank you to Jack Lamirande (VMware) and Brian Whitman (EMC, VMware Specialist) for helping me head scratch through this once.
So - what is happening here, and why is it hitting you (perhaps) for the first time? Virtual center is planning for the worst case to cover you.
There are three core things to understand.
Step 1: VC calculates a "Maximum slot size" for the cluster based on the biggest VM
Step 2: VC calculates the number of slots of that size available in the cluster (removing the biggest servers up to the number of host failures you specify)
Step 3: VC checks against the cluster availability constraints
Step 1 - Determining Max Slot Size: Ok - what is a "slot"? It means "worst cast CPU/Memory". This is discussed in the VMware Resource Management Guide (it's funny, people read the HCL like it's holy gospel - and it kinda is - and sometimes good docs fall by the wayside) here: https://www.vmware.com/pdf/vi3_35/esx_3/r35/vi3_35_25_resource_mgmt.pdf. Check out page 75.
- Maximum Virtual CPUs is obvious
- Maximum Memory Overhead I believe is the bare minimum it takes to boot a given guest (if there's no reservation) - (getting the official definition of this one)
- Maximum Reserved MHz and Maximum Reserved are the DRS reservations you put in for VMs.
This is the key, so an example is useful:
- Virtual Machine 1 is configured with no CPU reservation, 4vCPUs and a 512MB memory reservation
- Virtual Machine 2 is configured with a 2GHz CPU reservation, 2vCPUs and a 1GB memory reservation
- Virtual Machine 3 is configured with a 1GHz CPU reservation, 1vCPUs, and a 2GB memory reservation
VC 2.0.x would calculate a Slot Size of 2GHz of CPU, 2GB of RAM (VM2 is the largest from a CPU standpoint, VM3 from a RAM standpoint)
VC 2.5 would calculate a Slot Size of 8GHz of CPU, 2GB of RAM. Whoa! what happened was that since VC 2.5 starts to think about vCPUs - it multiplies the max number of vCPUs on VM1 (which was 4) times the Maximum MHz of VM2. VC 2.5 will also add a fair amount of RAM vs. VC 2.0.x - 256MB per guest.
Step 2 - Calculating Host Available Slots: VC then calculates the number of slots available on each ESX server based on this "Maximum Slot Size", basically dividing the ESX servers memory by the maximum slot size memory requirement, and the ESX server's CPU (CPU in MHz * number of CPUs = total MHz). Each cluster node is assigned a "number of slots" it can support.
Let's use another example (assuming our VC 2.0.x and VC 2.5 slot size from Step 1)
- ESX Server 1 has 16GB of RAM and 4 CPU's running at 3GHz - with VC 2.0.x = 8 memory slots & 6 CPU slots; with VC 2.5 = 8 memory slots & 1 CPU slot
- ESX Server 2 has 32GB of RAM and 8 CPU's running at 2GHz - with VC 2.0.x = 12 memory slots & 8 CPU slots; with VC 2.5 = 8 memory slots & 2 CPU slot
- ESX Server 3 has 32GB of RAM and 8 CPU's running at 3GHz - with VC 2.0.x = 12 memory slots & 12 CPU slots; with VC 2.5 = 8 memory slots & 3 CPU slot
Step 3 Applying Cluster Policy: If you have set the failover capacity to "1" (which is the default) via the cluster settings (screenshot below), the host with the largest number of possible slots is dropped from the calculation, and then you're left with the total number of slots available (sum of slots available on each node)
So, using the examples we've been using, you have a total of 14 slots with VC 2.0.x and 3 slots with VC 2.5 - for the same configuration.
You can see immediately why some people hit a problem when they upgrade. Virtual Center 2.5 is more conservative. This is an outcome that happens when you support more and more mission-critical customer use cases (and if you're wondering why EMC is so adamant about our e-lab and interop checks, it's the same root behavior).
There are a couple of workarounds if you are willing to play a little more loose.
- switch to "Allow Virtual Machines to be powered on even if they violate availability constraints", then make sure you specify important VMs (VM HA priority) one by one
- avoid using large reservations and multiple vCPUs.
MORAL OF THE STORY
No, the moral of the story isn't "don't upgrade" or "VMware's HA engine is busted" or "They're just trying to sell more _____" (the conspiracy nut favorite)
The moral of the story is be smart, and know how the IT tools you use work. VMware changed the formula, for a simple reason - to make the environment more available when you really need that.
Note how some of these best practices (not formalized enough to call them Best Practices with capital letters, I'll call these "cbp"'s or Chad's best practices :-) will dramatically make this better. The general rule is to avoid over-engineering. Over-engineering forces the VM HA algorithm into bad corners.
- Use DRS reservations and limits when you really need it, and are willing to invest the time to understand it. Notice that reservations will generally be materially larger than the Maximum Memory Overhead. Don't be afraid to have a DRS enabled cluster where the majority of machines have the default (no reservation, no limit), but likewise, don't be afraid to make reservations on the VMs that are really important.
- Are you REALLY sure you need 2 or 4 vCPUs? Note the MASSIVE effect these have on the math here. VMs will use all the horsepower available under the hood even with one vCPU - but notice how it has a HUGE effect on the slot size. If you really HAVE to have these, consider placing them on their own ESX cluster (see the NOTE below)
- Use a standard common HW platform for every node of an ESX cluster (a common maximum "host available slots")
- VMware is very efficient with large cluster sizes. I'm always surprised to see people with two 6-node clusters right beside one another. When you ask "why not one 12-node cluster", it's usually "it made us uncomfortable". This is usually rooted in experience in the increasing complexity of MSCS (now WSFC) or Oracle RAC. VMware ESX Clusters and VM HA are architecturally different. VMware's cluster is really a loosely coupled set of servers, shared storage only, with VC acting as orchestrator. This means cluster scaling and complexity are the same with lots of nodes. What does get better is that the VM HA algorithm has more choices.
- NOTE: there are some good reasons to have a couple smaller clusters: a) you keep "like" VMs on the same cluster - note that the Maximum Slot size is the BIGGEST VM - so if you keep small VMs on one cluster, and big VMs on another, the algorithm is more efficient.; b) your end users demand it (sometimes this is the case in multi-tenancy "clouds").
Hope this is useful! Remember - it's current as of TODAY (June 16th, 2008 - ESX 3.5u1, VC 2.5). Make sure you get the latest info if you're looking at this months from now (though I will update the article...)
What are your experiences and thoughts on this topic?