UPDATE: Oct 21, 2009 – folks, this article was very popular, and the core principles still are used by VM HA, but the admittance algorithms have changed a LOT since it was written. Strongly suggest going to the always excellent Yellow-Bricks for a more current update on VM HA internals and admittance algorithm behavior: http://www.yellow-bricks.com/vmware-high-availability-deepdiv/
This has been a question I've gotten a couple times in the last few weeks, and I noticed it's a prominent question in the forums I check out as much as I can. What's the symptom - you can't create new VMs without violating the availability constraints - even when you think you have plenty left, or you upgraded from Virtual Center 2.0.x to 2.5 and can't shake the "Insufficient Resources to Satisfy" HA error. This looks like this for most customers that are hitting this after a VC 2.0.x upgrade (i.e. they can't enable the "don't start machines if they violate availability constraints" option)
or, like this if you're trying to power on a machine and DO have the "don't start" option ticked.
I can tell you that designing these mechanisms (reservations) is always harder than it looks at first glance - which makes them fun engineering problems, but can result in weird behaviors - almost universally with the same symptom: satisfying the reservation takes more than the end-user expects. This is particularly true when the system is composed of sub elements that can all be different - i.e. there is no standard "quantum" element. Geez - sounds like VMware - no? You've got ESX servers in a cluster (which don't have to be all the same CPU and memory config) supporting VMs (each with unique amounts of vCPUs, RAM, and DRS reservations/limits)
So - how exactly does VM HA's admittance algorithm work? It's not a secret, and not complex, not a bug, but also not obvious... Read on.....
First - here are some discussions on this topic if you want to immerse yourself (and get into the habit of frequenting some solid sites)...
http://blog.scottlowe.org/2008/01/07/vmware-ha-clarification/
http://communities.vmware.com/message/822784
http://communities.vmware.com/thread/126756
This fellow has a lot of good things to say on the topic also...
A big thank you to Jack Lamirande (VMware) and Brian Whitman (EMC, VMware Specialist) for helping me head scratch through this once.
So - what is happening here, and why is it hitting you (perhaps) for the first time? Virtual center is planning for the worst case to cover you.
There are three core things to understand.
Step 1: VC calculates a "Maximum slot size" for the cluster based on the biggest VM
Step 2: VC calculates the number of slots of that size available in the cluster (removing the biggest servers up to the number of host failures you specify)
Step 3: VC checks against the cluster availability constraints
Step 1 - Determining Max Slot Size: Ok - what is a "slot"? It means "worst cast CPU/Memory". This is discussed in the VMware Resource Management Guide (it's funny, people read the HCL like it's holy gospel - and it kinda is - and sometimes good docs fall by the wayside) here: http://www.vmware.com/pdf/vi3_35/esx_3/r35/vi3_35_25_resource_mgmt.pdf. Check out page 75.
So, how is slot size calculated? This table compares VC 2.0.x and 2.5:
- Maximum Virtual CPUs is obvious
- Maximum Memory Overhead I believe is the bare minimum it takes to boot a given guest (if there's no reservation) - (getting the official definition of this one)
- Maximum Reserved MHz and Maximum Reserved are the DRS reservations you put in for VMs.
This is the key, so an example is useful:
- Virtual Machine 1 is configured with no CPU reservation, 4vCPUs and a 512MB memory reservation
- Virtual Machine 2 is configured with a 2GHz CPU reservation, 2vCPUs and a 1GB memory reservation
- Virtual Machine 3 is configured with a 1GHz CPU reservation, 1vCPUs, and a 2GB memory reservation
VC 2.0.x would calculate a Slot Size of 2GHz of CPU, 2GB of RAM (VM2 is the largest from a CPU standpoint, VM3 from a RAM standpoint)
VC 2.5 would calculate a Slot Size of 8GHz of CPU, 2GB of RAM. Whoa! what happened was that since VC 2.5 starts to think about vCPUs - it multiplies the max number of vCPUs on VM1 (which was 4) times the Maximum MHz of VM2. VC 2.5 will also add a fair amount of RAM vs. VC 2.0.x - 256MB per guest.
Step 2 - Calculating Host Available Slots: VC then calculates the number of slots available on each ESX server based on this "Maximum Slot Size", basically dividing the ESX servers memory by the maximum slot size memory requirement, and the ESX server's CPU (CPU in MHz * number of CPUs = total MHz). Each cluster node is assigned a "number of slots" it can support.
Let's use another example (assuming our VC 2.0.x and VC 2.5 slot size from Step 1)
- ESX Server 1 has 16GB of RAM and 4 CPU's running at 3GHz - with VC 2.0.x = 8 memory slots & 6 CPU slots; with VC 2.5 = 8 memory slots & 1 CPU slot
- ESX Server 2 has 32GB of RAM and 8 CPU's running at 2GHz - with VC 2.0.x = 12 memory slots & 8 CPU slots; with VC 2.5 = 8 memory slots & 2 CPU slot
- ESX Server 3 has 32GB of RAM and 8 CPU's running at 3GHz - with VC 2.0.x = 12 memory slots & 12 CPU slots; with VC 2.5 = 8 memory slots & 3 CPU slot
Step 3 Applying Cluster Policy: If you have set the failover capacity to "1" (which is the default) via the cluster settings (screenshot below), the host with the largest number of possible slots is dropped from the calculation, and then you're left with the total number of slots available (sum of slots available on each node)
So, using the examples we've been using, you have a total of 14 slots with VC 2.0.x and 3 slots with VC 2.5 - for the same configuration.
You can see immediately why some people hit a problem when they upgrade. Virtual Center 2.5 is more conservative. This is an outcome that happens when you support more and more mission-critical customer use cases (and if you're wondering why EMC is so adamant about our e-lab and interop checks, it's the same root behavior).
There are a couple of workarounds if you are willing to play a little more loose.
- switch to "Allow Virtual Machines to be powered on even if they violate availability constraints", then make sure you specify important VMs (VM HA priority) one by one
- avoid using large reservations and multiple vCPUs.
MORAL OF THE STORY
No, the moral of the story isn't "don't upgrade" or "VMware's HA engine is busted" or "They're just trying to sell more _____" (the conspiracy nut favorite)
The moral of the story is be smart, and know how the IT tools you use work. VMware changed the formula, for a simple reason - to make the environment more available when you really need that.
Note how some of these best practices (not formalized enough to call them Best Practices with capital letters, I'll call these "cbp"'s or Chad's best practices :-) will dramatically make this better. The general rule is to avoid over-engineering. Over-engineering forces the VM HA algorithm into bad corners.
- Use DRS reservations and limits when you really need it, and are willing to invest the time to understand it. Notice that reservations will generally be materially larger than the Maximum Memory Overhead. Don't be afraid to have a DRS enabled cluster where the majority of machines have the default (no reservation, no limit), but likewise, don't be afraid to make reservations on the VMs that are really important.
- Are you REALLY sure you need 2 or 4 vCPUs? Note the MASSIVE effect these have on the math here. VMs will use all the horsepower available under the hood even with one vCPU - but notice how it has a HUGE effect on the slot size. If you really HAVE to have these, consider placing them on their own ESX cluster (see the NOTE below)
- Use a standard common HW platform for every node of an ESX cluster (a common maximum "host available slots")
- VMware is very efficient with large cluster sizes. I'm always surprised to see people with two 6-node clusters right beside one another. When you ask "why not one 12-node cluster", it's usually "it made us uncomfortable". This is usually rooted in experience in the increasing complexity of MSCS (now WSFC) or Oracle RAC. VMware ESX Clusters and VM HA are architecturally different. VMware's cluster is really a loosely coupled set of servers, shared storage only, with VC acting as orchestrator. This means cluster scaling and complexity are the same with lots of nodes. What does get better is that the VM HA algorithm has more choices.
- NOTE: there are some good reasons to have a couple smaller clusters: a) you keep "like" VMs on the same cluster - note that the Maximum Slot size is the BIGGEST VM - so if you keep small VMs on one cluster, and big VMs on another, the algorithm is more efficient.; b) your end users demand it (sometimes this is the case in multi-tenancy "clouds").
Hope this is useful! Remember - it's current as of TODAY (June 16th, 2008 - ESX 3.5u1, VC 2.5). Make sure you get the latest info if you're looking at this months from now (though I will update the article...)
What are your experiences and thoughts on this topic?
Hi
Very good article!!! My compliments :-)
I linked to your article from my blog, hope you don't mind:
http://www.gabesvirtualworld.com/?p=70
Gabrie
Posted by: Gabrie van Zanten | June 17, 2008 at 06:23 AM
Excellent work Chad, I'll have to stop blogging soon if you keep posting this level of quality content so frequently ;-)
Posted by: Stu | June 17, 2008 at 07:35 AM
Stu - you're not being fair to yourself... give yourself a plug, I frequent your blog often! Ok, I'll plug it - http://www.vinternals.com
Your SRM post is a good one - http://www.vinternals.com/2008/05/what-vmware-site-recovery-manager-isnt.html.
We have an EMC saying: Backup and DR are DIFFERENT. You need to do regular system backup for application recovery. The other thing is that (in general, each array replication technology is different), SRM is based on the idea of a "real time" remote copy (even if "real-time" is time-shifted with async replicas). The point is ONE COPY at the remote site. There's the persistent issue of errors being replicated to the remote site. Each of our replication technologies has as option to avoid this (snaps of target for MV, SRDF, Celerra Replicator - or in Recoverpoint's case - continous data protection)
EMC's answers for local backup (not the only ones) are:
1) Replication Manager 5.1.2 (trying to get them to rename it to "Replication Manager for VMware" - this handles application-integrated backup, VMFS-level integration, and array snapshot mechanisms.
2) Avamar Virtual Edition (and regular) - source based dedupe, very cool, very popular with VMware. For Networker customers - Avamar can be integrated also.
Posted by: Chad Sakac | June 17, 2008 at 09:32 AM
Gabrie - thank you! You can absolutely link of course!.... Still trying to figure out how this trackback thing works. It's always fun to learn new things!
Posted by: Chad Sakac | June 17, 2008 at 09:33 AM
Good post! I wrote about it two months ago : http://kurrin.blogspot.com/2008/04/como-funciona-vmware-ha.html
(sorry in Spanish).
A couple of things, one related with 14 ESX clusters:
1.- Sometimes people prefer not to exceed the 8 ESX node cluster because VMFS has a bad performance over this number of nodes (8) accessing the same VMFS formated LUNs. It's a storage/performance Best Practice.
2.- The other thing is: How can we calculate the number of VMs that we can power on in a certain Cluster? Adding the slots? In your example:
VC2.0 -> The smaller of [(8 slotmem + 12 slotmem) OR (6 slot CPU + 8 slot CPU)] = 14 Virtual Machines Max?
VC2.5 -> The smaller of [(8 slotmem + 8 slotmem) OR (1 slot CPU + 2 slot CPU)] = 3 Virtual Machines Max?
Am I right?
3.- My personal recommendation is that it's better to use Shares in a DRS cluster (because of its dynamic behaviour) instead of using Reservations.
Thnx!
Jon
Posted by: Jon | June 17, 2008 at 12:38 PM
Good post! I wrote about it two months ago : http://kurrin.blogspot.com/2008/04/como-funciona-vmware-ha.html
(sorry in Spanish).
A couple of things, one related with 14 ESX clusters:
1.- Sometimes people prefer not to exceed the 8 ESX node cluster because VMFS has a bad performance over this number of nodes (8) accessing the same VMFS formated LUNs. It's a storage/performance Best Practice.
2.- The other thing is: How can we calculate the number of VMs that we can power on in a certain Cluster? Adding the slots? In your example:
VC2.0 -> The smaller of [(8 slotmem + 12 slotmem) OR (6 slot CPU + 8 slot CPU)] = 14 Virtual Machines Max?
VC2.5 -> The smaller of [(8 slotmem + 8 slotmem) OR (1 slot CPU + 2 slot CPU)] = 3 Virtual Machines Max?
Am I right?
3.- My personal recommendation is that it's better to use Shares in a DRS cluster (because of its dynamic behaviour) instead of using Reservations.
Thnx!
Jon
Posted by: Jon | June 17, 2008 at 12:42 PM
Good post! I wrote about it two months ago : http://kurrin.blogspot.com/2008/04/como-funciona-vmware-ha.html
(sorry in Spanish).
A couple of things, one related with 14 ESX clusters:
1.- Sometimes people prefer not to exceed the 8 ESX node cluster because VMFS has a bad performance over this number of nodes (8) accessing the same VMFS formated LUNs. It's a storage/performance Best Practice.
2.- The other thing is: How can we calculate the number of VMs that we can power on in a certain Cluster? Adding the slots? In your example:
VC2.0 -> The smaller of [(8 slotmem + 12 slotmem) OR (6 slot CPU + 8 slot CPU)] = 14 Virtual Machines Max?
VC2.5 -> The smaller of [(8 slotmem + 8 slotmem) OR (1 slot CPU + 2 slot CPU)] = 3 Virtual Machines Max?
Am I right?
3.- My personal recommendation is that it's better to use Shares in a DRS cluster (because of its dynamic behaviour) instead of using Reservations.
Thnx!
Jon
Posted by: Jon | June 17, 2008 at 12:45 PM
Thanks Jon - wish I could read spanish, you would have saved me a lot of time!
Quick comments,
1) I haven't seen that VMFS issue (bad VMFS performance with more than 8 hosts) - do you mind providing the source?
It's certainly not an EMC storage best practice - I want to make sure it's not an old thing. Most VMFS "limits" are mythology - not saying this is or it isn't but would like to make sure.
2) You got it right.
3) Interesting idea - you can certainly use shares (which are relative to one another and don't work into the math) - my preference is use shares, but then there are some VMs where you absolutely need reservations - use them. It's not the reservations that really hammer the map, it's the vCPU multiplier.
Posted by: Chad Sakac | June 17, 2008 at 04:40 PM
Thanks for your response Chad, sorry for the 3 identic comments, I think it was a typepad.com problem(after the capcha page, it stopped and never gave me the Ok, so I insisted 3 times...;)
Some comments/questions:
1.- The VMFS issue: I think it's a Best Practice, I've heard several times about it, the last time here: http://vmetc.com/2008/06/10/vmfs-storage-sizing-for-maximum-performance/
I personally think that it VARIES from one environment to another, sometimes we have very heavy VMs and other times not.
I think that the DRS infrastructure is like building a wall with irregular bricks.
In most cases a 12 host cluster is appropiate because there are no(or a few) CPU/Mem killer VMs. In other cases a 6 cluster + 6 cluster is better in terms of storage and in terms of DRS maximization.
Also, we don't have to forget that in most cases we use HA with DRS, so 16ESX is the limit (I think) (http://communities.vmware.com/thread/97465)
2.- ok, so the algorithm is really pretty conservative. I hope VMware will change this issue(If they can...)
We can always check "start machines even if they violate availability constraints" option in the event of a host failure (once a year?, once each two years?...)
3.- Completely agree. Only use Reservation if you REALLY need them. And if th VM needs high reserves, think if it has to be virtual or not, or if it is worth to put in another esx or not.
Thank you!
I'm going to press Post button, we'll see If I post another 3 times... ;-)
Jon
Posted by: Jon | June 18, 2008 at 08:25 AM
A very useful post! I want to ask a quick question regarding the math involved in the calculation. We calculate the host memory slots based on the host memory divided by memory slot size. So in the examples below it should be 32/2 = 16 slots.
ESX Server 2 has 32GB of RAM and 8 CPU's running at 2GHz - with VC 2.0.x = 12 memory slots & 8 CPU slots; with VC 2.5 = 8 memory slots & 2 CPU slot
ESX Server 3 has 32GB of RAM and 8 CPU's running at 3GHz - with VC 2.0.x = 12 memory slots & 12 CPU slots; with VC 2.5 = 8 memory slots & 3 CPU slot
Is this correct or did I missed something in the article?
Alex
Posted by: Alex | June 23, 2008 at 03:54 PM
Great post very informative, I wouldnt mind seeing how you manage to get 54 Outlook items open!!!!
Posted by: Daniel Eason | June 25, 2008 at 09:31 AM
Jon - you're correct, and I mistyped that. thanks for the correction!
Posted by: Chad Sakac | June 26, 2008 at 06:38 PM
Great post! Thanks a million!
For everyone that has difficulties calculating this for their own specific environment, I have created a script that can do the hard work for you.
Download it here:
http://www.peetersonline.nl/index.php/vmware/helpful-script-of-the-day-ha-calculations/
Posted by: Hugo Peeters | July 25, 2008 at 08:39 AM
What if you have say 4 VMs (each with 1 vCPU each) in a resource pool with 8000 MHz assigned to the resource pool... how does that affect the HA math?
Posted by: alex trip | April 14, 2009 at 03:23 AM
Chad -
I’m just trying to clarify slot size calculations and how to roll it up to figuring out the number of requires servers in an HA cluster.
Lets say the largest VM is 4CPU and 16GB RAM (with a 16GB reservation). The ESX servers are all 2 socket, quad core 3GHz with 32GB RAM.
With overhead (about 650), my RAM slot size would end up being around 17GB., which gives me less than two slots per ESX server. Is this correct?
Now, if I have a VM with only 1 CPU and 2GB RAM, it will still take up a slot. If I do not change the default slot size settings, the slot is roughly 75% wasted. Is this correct?
In this scenario, if I do not “tweak” the slot sizes, do I only get one VM per node since it works out to about 1.7 slots per node?
Dave
Posted by: Dave Convery | September 20, 2009 at 12:33 PM
Dave, for more recent info:
http://www.yellow-bricks.com/vmware-high-availability-deepdiv/
and like we discussed your assumptions are correct.
Posted by: Duncan | October 21, 2009 at 03:59 AM
it's really amazing :)
Posted by: amriendra | February 15, 2010 at 09:01 AM
Thanks for the tutorial, it's pretty helpful.
Posted by: Küchen Freiburg | June 08, 2010 at 10:17 AM
Grüße from Freiburg
Posted by: Küchen Freiburg | June 17, 2010 at 11:53 AM
Hi,
Please let me know how and when HA will recalculate Slot's after addition of VM's to HA cluster.
Thanks,
Khan
Posted by: khan | November 25, 2010 at 02:00 PM
Great post .. and the comment "satisfying the reservation takes more than the end-user expects" made me laugh. Because we had this error and it took me quite a bit of research and talking with vmware help to get us going. (And by the way while I like vmware support, they are very talented, I think I got someone who wasn't too good - just kept blowing us off so I had to go through the manual).
We were barely using 20% CPU and 30% RAM and couldn't power on another VM with a host failure cluster tolerates set to 1 (of 5 servers). When I changed to percentage, I was able to take it to 75% reservation and still power on VMs. (WHAT?). We fixed a lot of our VM reservation issues so we are good now, but I couldn't even imagine why the failure cluster tolerates wasn't working. So anyway, long story short (short story long?) this post about calculating the slot size, and how conservative that calculation is, was quite enlightening!! Thanks!!
My adventures in HA and Host Failure Cluster Tolerates here:
http://geekswing.com/geek/vmware-cpu-and-ram-reservations-fixing-insufficient-resources-to-satisfy-configured-failover-level-for-ha/
Posted by: Ben @ geekswing | June 04, 2013 at 05:38 PM