It’s a common refrain:
- “VMFS doesn’t scale because of locking”
- “2TB limit is a show-stopper”
- “Spanning VMFS across multiple extents is a really, really bad idea, and don’t buy you anything because they are just concatenated”
- “You can only have 32 VMs per VMFS”
OK – none, I repeat NONE of this is true.
I’m working on a chapter of a book to try to dispel some of this and clear the air (and also at the same time talk about all the great uses of NFS – THERE IS NO WAR – only certain people want there to be). It was also a recent topic amongst our EMC VMware Specialists.
To understand this better – and perhaps a balanced perspective from your friendly, technically accurate, (and NFS-lovin’) neighborhood VirtualGeek – please read on….
PS – I almost titled this “VMFS Manifesto” but then thought that sounded RIDICULOUSLY pretentious :-) (cloud aficionados, you get the joke)
Ok – this one tipped me over the edge:
I don’t mind so much when it comes from a “NFS is the ONLY WAY!!” bigot (at least you know it’s coming :-) , but Robin is a third party – so the FUD is in the mainstream (don’t get me wrong, it’s not the evil FUD that is intentionally incorrect like competitve team crapola, at least it’s innocent).
Ok – read this: http://www.vmware.com/resources/techresources/1059
Specifically – look at pages 2-5.
FALSE: “VMFS doesn’t scale because of locking”
I’ve talked about this before – SCSI locks are used when the VMFS meta data gets updated (VMs getting created/deleted, ESX snapshots, VM HA operations). For operations that take long (creating an eagerzeroedthick VM – like deploy from template/clone in 3.5) this can take minutes. After being FUDed, people seem to think that during this time, the other host is forced to sit idle and just chill from an I/O standpoint.
This table from the VMware whitepaper covers this.
The table isn’t totally clear – so let me see if I can explain it. In this test, there are two ESX hosts, and the goal is to determine the effect of The first test (“No virtual machine creation or deletion”) is the baseline test. Then events that result in a SCSI reserve (creating and deleting a VM) were injected, and the impact on the adjacent host (Host 2) and the host holding the reservation (Host 1).
This document shows that the impact of an active SCSI reservation on adjacent hosts was about 7%, and on the host with the SCSI reservation itself was 2%. Look – it’s a best practice to schedule those operations, but you can ABSOLUTELY do them.
FALSE: “2TB is a show-stopper”/“multi-VMFS extents are a really, really bad idea, and don’t buy you anything because they are just concatenated”
The “2TB limit of VMFS” isn’t actually a VMFS-3 limit, it’s a VMware LVM limit. VMware volume manager uses CHS rather than GPT partition mechanisms (will also be the case in vSphere). VMFS can have a total of 32 extents. There’s also a lack of understanding (often by ourselves too!) about how the extends are used. In VMFS-3, VM placement rotates between the extents, so you start to get the benefit with out needing to “fill up an extent”.
Also – in VMFS-3 (this was not the case with VMFS-2 used in ESX 2.x), you can actually remove the LUN backing a VMFS extent in a multi-extent VMFS filesystem and the filesystem is still accessible (but not if you lose the first, original extent – but that’s no worse than a single VMFS volume/LUN/filesystem config) – so many of the “it will self-destruct if you breathe wrong” is way off base.
In other words – a lot of “DON’T USE EXTENTS OR THE WORLD WILL END!!!” is rooted in VMFS-2.
There’s a significant upside of having multi-extent VMFS filesystems – you increase your host LUN queue count for every LUN you add (using a array-based technique like MetaLUN objects – you increase the LUN queues on the array side, but not the ESX host side). Some of the highest performance VMFS volumes in the world use this technique – and customers LOVE THEM (see below for a bit more on why)
Don’t get me wrong – you need to be a brown-belt storage admin (not a yellow belt) to reasonably use this – and really should leverage array multi-LUN consistency technologies to manage the group of LUNs as a unit, particularly as it replicates. Glad to say we can happily help with that.
In general – 2TB represents a larger limit than most customers need – which is why the “just use a single extent/single LUN” is the “happy/simple" rule.
FALSE: “You can only have 32 VMs per VMFS”
I think the origins of the very conservative numbers are from EMC ourselves – as we wanted to make sure people didn’t shoot themselves in the foot. 16 VMs was the early recommendation from us, but that was classic over-engineering. 32 very I/O intensive VMs do just fine. Heck, we’ve shown 64 VMs using VMware view composer and there wasn’t any issue. So the question is – what’s the limit? ANSWER: there is NO EFFECTIVE LIMIT. Depending on how you format your VMFS filesystem, you will have a varying maximum number of file objects – but that’s always very high relative to the number of VMs per VMFS volume. Ok – then there’s the other limit – UPDATE – thanks Stu - 192 vCPUs per ESX host (I don’t think this has changed in vSphere either, but will verify) – in cases with MANY VMs per datastore, that usually means many per ESX host, and then you hit that limit.
So what is the practical limit? The doc spells it out pretty well. First – a handy-dandy table to estimate.
and a specific recommendation on the ESX host:
“The sum of active SCSI commands from all virtual machines running on an ESX host sharing the same LUN should not consistently exceed the LUN queue depth configured on the ESX host, to a maximum of 64.” (NOTE: THIS DOESN’T TRANSLATE TO X VMs)
and on the Array:
“Determine the maximum number of outstanding I/O commands to the shared LUN (or VMFS volume). This value is specific to the storage array you are using, and you might need to consult your storage array vendor to get a good estimate for the maximum outstanding commands per LUN. A latency of 50 milliseconds is usually a reliable indication that the storage array either does not have enough resources or is not configured optimally for its current use.”
Easy to look at the ESX side of it – the command’s right on page 2.
Now – the trick is that the queue depth will depend on how busy the VMs are, and how quickly the array can service those requests. Faster array, faster storage configuration, more VMs per VMFS. For busier VMFS volumes, increasing the LUN and HBA queue depth can help (this is pretty corner case, if you ask me – stick with the defaults unless you know enough to be sure that this is the right thing to do – deeper queues can increase latency), which is covered for Qlogic an Emulex HBA in the excellent VMware Fiber Channel SAN configuration guide.
Also, if you have a Spanned VMFS volume, you have many LUN queues (ergo more parallelism)
- To set maximum queue depth for a QLogic HBA:
- Log in to the service console as root.
- Verify which QLogic HBA module is currently loaded: vmkload_mod -l | grep qla2300
- Depending on the model of the HBA, the module can be one of the following:
qla2300_707 (ESX Server 3.0.x)
qla2300_707_vmw (ESX Server 3.5)
- The example shows the qla2300_707 module.
- esxcfg-module -s ql2xmaxqdepth=64 qla2300_707
- esxcfg-boot –b
- In this case, the HBA represented by ql2x will have its LUN queue depth set to 64.
- You need to also update this advanced settings parameter “Disk.SchedNumReqOutstanding” which governs how many outstanding requests can be pushed from the VM guests in aggregate – it needs to rise to match the new queue depth. Here’s a screenshot from the vSphere RC bits – but it’s in the same place in ESX 3.5, and will be in the same place in the GA build.
(since I can’t help myself) – another “just leave it at the default, but that’s kinda neat” setting is the the Disk.MaxLUN setting – if you make this less than 256 – it can speed up cluster rescans – which btw DO seem faster in general on vSphere… In general, I don’t recommend changing this setting as you’re optimzing for “argh – this 5 minute wait is killing me” vs. “omigod, I can’t figure out why LUN 56 isn’t showing up!!!”, and it’s too easy to forget to change this cluster-wide.
Gang – VMFS and NFS are both filesystems (in NFS’s case, it’s an export of a filesystem on an NFS server). Both of them have these constructs (LUN queues, LUN counts/maximums, volume managers) – the only question is WHERE those limit occur. In one case, they are on the VMware host (VMFS), and on one (NFS) they are on the NFS server (where they are obscured and handled internally to the fileserver itself).
BTW – all this stuff has been out there for a while, and are covered in the excellent Symmetrix, CLARiiON and Celerra VMware Solutions Guides. They’re getting updated now for vSphere, so stay tuned for the updates.
Moral of the story? In both cases, the defaults are GOOD ENOUGH. People throw stuff around to put the fear of god into people to move them to their agenda. Don’t listen to things that look/smell/feel like FUD.
Think of it this way:
- Designing “good enough for most cases” (which is where the “best practices” are generally rooted have about the same degree of complexity (simple).
- VMFS and NFS have about the same degree of complexity (complex) when you are aiming for maximum limits – the question is where does the design get hairy?
- In the VMFS case, it’s about LUN queues on the ESX host (and whether you choose to use VMFS spanning multiple extents) and the array and the back-end spindle design and the choice of whether to span extents to involve multiple LUN queues.
- In the NFS case, it’s about ESX optimization for NFS client memory/buffering, scaling out the number of datastores and VM distribution to get around the maximum number of TCP sessions per datastore (or moving to 10GbE), carefully back-end design of the filesystem FlexVol/Aggregate (NetApp) or AVM/dVol (EMC Celerra) configuration.