It’s a common refrain:
- “VMFS doesn’t scale because of locking”
- “2TB limit is a show-stopper”
- “Spanning VMFS across multiple extents is a really, really bad idea, and don’t buy you anything because they are just concatenated”
- “You can only have 32 VMs per VMFS”
OK – none, I repeat NONE of this is true.
I’m working on a chapter of a book to try to dispel some of this and clear the air (and also at the same time talk about all the great uses of NFS – THERE IS NO WAR – only certain people want there to be). It was also a recent topic amongst our EMC VMware Specialists.
To understand this better – and perhaps a balanced perspective from your friendly, technically accurate, (and NFS-lovin’) neighborhood VirtualGeek – please read on….
PS – I almost titled this “VMFS Manifesto” but then thought that sounded RIDICULOUSLY pretentious :-) (cloud aficionados, you get the joke)
Ok – this one tipped me over the edge:
http://storagemojo.com/2009/03/20/ciscot-bong-sized-cloud-telcos-only/#comments
I don’t mind so much when it comes from a “NFS is the ONLY WAY!!” bigot (at least you know it’s coming :-) , but Robin is a third party – so the FUD is in the mainstream (don’t get me wrong, it’s not the evil FUD that is intentionally incorrect like competitve team crapola, at least it’s innocent).
Ok – read this: http://www.vmware.com/resources/techresources/1059
Specifically – look at pages 2-5.
FALSE: “VMFS doesn’t scale because of locking”
I’ve talked about this before – SCSI locks are used when the VMFS meta data gets updated (VMs getting created/deleted, ESX snapshots, VM HA operations). For operations that take long (creating an eagerzeroedthick VM – like deploy from template/clone in 3.5) this can take minutes. After being FUDed, people seem to think that during this time, the other host is forced to sit idle and just chill from an I/O standpoint.
This table from the VMware whitepaper covers this.
The table isn’t totally clear – so let me see if I can explain it. In this test, there are two ESX hosts, and the goal is to determine the effect of The first test (“No virtual machine creation or deletion”) is the baseline test. Then events that result in a SCSI reserve (creating and deleting a VM) were injected, and the impact on the adjacent host (Host 2) and the host holding the reservation (Host 1).
This document shows that the impact of an active SCSI reservation on adjacent hosts was about 7%, and on the host with the SCSI reservation itself was 2%. Look – it’s a best practice to schedule those operations, but you can ABSOLUTELY do them.
FALSE: “2TB is a show-stopper”/“multi-VMFS extents are a really, really bad idea, and don’t buy you anything because they are just concatenated”
The “2TB limit of VMFS” isn’t actually a VMFS-3 limit, it’s a VMware LVM limit. VMware volume manager uses CHS rather than GPT partition mechanisms (will also be the case in vSphere). VMFS can have a total of 32 extents. There’s also a lack of understanding (often by ourselves too!) about how the extends are used. In VMFS-3, VM placement rotates between the extents, so you start to get the benefit with out needing to “fill up an extent”.
Also – in VMFS-3 (this was not the case with VMFS-2 used in ESX 2.x), you can actually remove the LUN backing a VMFS extent in a multi-extent VMFS filesystem and the filesystem is still accessible (but not if you lose the first, original extent – but that’s no worse than a single VMFS volume/LUN/filesystem config) – so many of the “it will self-destruct if you breathe wrong” is way off base.
In other words – a lot of “DON’T USE EXTENTS OR THE WORLD WILL END!!!” is rooted in VMFS-2.
There’s a significant upside of having multi-extent VMFS filesystems – you increase your host LUN queue count for every LUN you add (using a array-based technique like MetaLUN objects – you increase the LUN queues on the array side, but not the ESX host side). Some of the highest performance VMFS volumes in the world use this technique – and customers LOVE THEM (see below for a bit more on why)
Don’t get me wrong – you need to be a brown-belt storage admin (not a yellow belt) to reasonably use this – and really should leverage array multi-LUN consistency technologies to manage the group of LUNs as a unit, particularly as it replicates. Glad to say we can happily help with that.
In general – 2TB represents a larger limit than most customers need – which is why the “just use a single extent/single LUN” is the “happy/simple" rule.
FALSE: “You can only have 32 VMs per VMFS”
I think the origins of the very conservative numbers are from EMC ourselves – as we wanted to make sure people didn’t shoot themselves in the foot. 16 VMs was the early recommendation from us, but that was classic over-engineering. 32 very I/O intensive VMs do just fine. Heck, we’ve shown 64 VMs using VMware view composer and there wasn’t any issue. So the question is – what’s the limit? ANSWER: there is NO EFFECTIVE LIMIT. Depending on how you format your VMFS filesystem, you will have a varying maximum number of file objects – but that’s always very high relative to the number of VMs per VMFS volume. Ok – then there’s the other limit – UPDATE – thanks Stu - 192 vCPUs per ESX host (I don’t think this has changed in vSphere either, but will verify) – in cases with MANY VMs per datastore, that usually means many per ESX host, and then you hit that limit.
So what is the practical limit? The doc spells it out pretty well. First – a handy-dandy table to estimate.
and a specific recommendation on the ESX host:
“The sum of active SCSI commands from all virtual machines running on an ESX host sharing the same LUN should not consistently exceed the LUN queue depth configured on the ESX host, to a maximum of 64.” (NOTE: THIS DOESN’T TRANSLATE TO X VMs)
and on the Array:
“Determine the maximum number of outstanding I/O commands to the shared LUN (or VMFS volume). This value is specific to the storage array you are using, and you might need to consult your storage array vendor to get a good estimate for the maximum outstanding commands per LUN. A latency of 50 milliseconds is usually a reliable indication that the storage array either does not have enough resources or is not configured optimally for its current use.”
Easy to look at the ESX side of it – the command’s right on page 2.
Now – the trick is that the queue depth will depend on how busy the VMs are, and how quickly the array can service those requests. Faster array, faster storage configuration, more VMs per VMFS. For busier VMFS volumes, increasing the LUN and HBA queue depth can help (this is pretty corner case, if you ask me – stick with the defaults unless you know enough to be sure that this is the right thing to do – deeper queues can increase latency), which is covered for Qlogic an Emulex HBA in the excellent VMware Fiber Channel SAN configuration guide.
Also, if you have a Spanned VMFS volume, you have many LUN queues (ergo more parallelism)
- To set maximum queue depth for a QLogic HBA:
- Log in to the service console as root.
- Verify which QLogic HBA module is currently loaded: vmkload_mod -l | grep qla2300
- Depending on the model of the HBA, the module can be one of the following:
qla2300_707 (ESX Server 3.0.x)
qla2300_707_vmw (ESX Server 3.5)
- The example shows the qla2300_707 module.
- esxcfg-module -s ql2xmaxqdepth=64 qla2300_707
- esxcfg-boot –b
- In this case, the HBA represented by ql2x will have its LUN queue depth set to 64.
- Reboot.
- You need to also update this advanced settings parameter “Disk.SchedNumReqOutstanding” which governs how many outstanding requests can be pushed from the VM guests in aggregate – it needs to rise to match the new queue depth. Here’s a screenshot from the vSphere RC bits – but it’s in the same place in ESX 3.5, and will be in the same place in the GA build.
(since I can’t help myself) – another “just leave it at the default, but that’s kinda neat” setting is the the Disk.MaxLUN setting – if you make this less than 256 – it can speed up cluster rescans – which btw DO seem faster in general on vSphere… In general, I don’t recommend changing this setting as you’re optimzing for “argh – this 5 minute wait is killing me” vs. “omigod, I can’t figure out why LUN 56 isn’t showing up!!!”, and it’s too easy to forget to change this cluster-wide.
In conclusion
Gang – VMFS and NFS are both filesystems (in NFS’s case, it’s an export of a filesystem on an NFS server). Both of them have these constructs (LUN queues, LUN counts/maximums, volume managers) – the only question is WHERE those limit occur. In one case, they are on the VMware host (VMFS), and on one (NFS) they are on the NFS server (where they are obscured and handled internally to the fileserver itself).
BTW – all this stuff has been out there for a while, and are covered in the excellent Symmetrix, CLARiiON and Celerra VMware Solutions Guides. They’re getting updated now for vSphere, so stay tuned for the updates.
Moral of the story? In both cases, the defaults are GOOD ENOUGH. People throw stuff around to put the fear of god into people to move them to their agenda. Don’t listen to things that look/smell/feel like FUD.
Think of it this way:
- Designing “good enough for most cases” (which is where the “best practices” are generally rooted have about the same degree of complexity (simple).
- VMFS and NFS have about the same degree of complexity (complex) when you are aiming for maximum limits – the question is where does the design get hairy?
- In the VMFS case, it’s about LUN queues on the ESX host (and whether you choose to use VMFS spanning multiple extents) and the array and the back-end spindle design and the choice of whether to span extents to involve multiple LUN queues.
- In the NFS case, it’s about ESX optimization for NFS client memory/buffering, scaling out the number of datastores and VM distribution to get around the maximum number of TCP sessions per datastore (or moving to 10GbE), carefully back-end design of the filesystem FlexVol/Aggregate (NetApp) or AVM/dVol (EMC Celerra) configuration.
AMEN to that post brother. It's nice to have your weight behind it ... NFS, like VMFS, IS JUST A FILE SYSTEM PEOPLE!! (I know NFS is a protocol backed by a filesystem, but you get the point) ... the world has not changed because "someone" has willed it to. Queues are everywhere, FS locking due to metadata updates are everywhere, optimizing i/o dispersion is everywhere, scaling considerations are true across the board. It's about balancing simplicity with scaleability. Sometimes the cookie cutter approach is "Good enough" and you will be equally served regardless of the implementation you select, but it is always good to see through the fog of FUD and know when it is time to put on your "big boy" hat and get down to business building a truly scalable solution. It's also nice to have a solution that is both turn-key easy and which provides you with the tools that you need at your disposable when that time comes.
Posted by: Aaron | March 31, 2009 at 03:42 PM
My $0.02... when asked about VMFS volume "vm capacity" by our customers I always aim to spread the VM's equally among all the available VMFS's. On a few occasions I recommended single VM/VMFS volume (high IO -demanding apps) but the rule of thumb is always to spread them around to maintain the IO down.
On another note... I am sure you've seen Intel releasing X5500 yesterday (the server flavor). What is your take on Nehalem's benefits in virtualization? I think it will be massive - I can't see many bloggers getting excited about this chip which I find a little surprising considering the tremendously positive improvements it has specifically in the realm of virtualization... I would be curious to find out to what extent vSphere will take advantage of the new virtualized hardware instruction set in Nehalem. Do you know if vSphere will take support of Nehalem further than what was just announced in ESX Upd.4?
Best,
Posted by: Paul Wegiel | March 31, 2009 at 03:53 PM
On the first comment - that's one design philosophy, and a good one Paul. KISS.
As customers get a bit bigger, I personally like to augment that by saying: "have two standard VMFS container definitions - a 'cheap n' dirty" and a "heavy". Virtually provision the backend storage to be efficient in both cases. Start efficient, and you will stay efficient. Use specialized VMFS containers or RDMs for focused use cases (and in the database use cases, consider Thick - Thin/VP buys you little there). Every cluster, IMHO should have at LEAST one NFS datastore - as they're REALLY handy dandy"
Paul - there are a LOT of Nehalem optimizations in vSphere - and other ones too (like the Intel I/O AT optimization in VMDirectPath).
EVC for example (on Intel) uses the Nehalem FlexMigration. As we (EMC) migrate to Nehalem over the next while, there are LOADS of optimizations we can make.
The day is coming soon!!!
Posted by: Chad Sakac | March 31, 2009 at 04:05 PM
another excellent post. Just on the current vCPU limits, it's actually 192 on 3.5 (but defaults to a lower value). Oddly enough the max number of VM's per host is "only" 170 however (mind you, we have hit that limit with non-prod 32 core / 128GB RAM boxes).
Next version of ESX will indeed be higher, but I dare not say the number in case I get in trouble ;)
Posted by: Stu | March 31, 2009 at 04:11 PM
Good post Chad, it is interesting to see someone defend block level technologies. I usually have to defend NFS from all the FUD surrounding it! I even did a recent post on all the FUD that swirls around NFS and VMware. I'm with you though, both Block Level and File Level Protocols work great for VMware and each have their perks and advantages. Interesting bit about extents I still cringe when I think about them but you are right, that was in the VMFS-2 days. VMware documention will have to catch up about their best practices on number of VMs per VMFS datastore though as most of the docs I see still refer to a smaller number of VMs per VMFS datastore. You are absolutely right though that with NFS you want to focus on the back end disk, insuring you have enough horsepower there.
Posted by: Keith Aasen | April 02, 2009 at 12:18 PM
Chad,
I have been digging around and cannot find any information about maximums in vSphere. I see you reference that the LUN limit will still be 2TB in vSphere. Do you know if the 2TB limit will also apply to RDM in vSphere? Thanks for the excellent post.
Posted by: Adam | April 06, 2009 at 10:57 AM
There are dependencies to changes that have been made in 3.5 that are not present in 3.0. I am currently researching to determine when the changes were made to the algorithm on placement per extent. The statement about the brown belt is 100% true. After going into countless environments that utilize extents without proper SAN design that are experiencing massive problems, please do not underestimate the potential problem. If you manage 20 or fewer VMs per LUN with multiple extents per datastore, your going to be fine. Just always remember that the basics apply. The response time of the array is allways the most important factor. If things are moving fast from a reservation request to response, there will not be any problems from that perspective. If you do get into a reservation problem, remember that the problem expands exponentially -- it basically feeds on itself.
True parallelism will not occur until multiple paths can be used with "PowerPath", etc.. We still have only one queue on the HBA for Read/Write request. You can split I/O on Active/Active arrays across multiple HBA's with manual effort. If you get extent happy, you will run out of LUN capacity per cluster. You can only have 256 LUNs per host. This translates to that per cluster. Once you get over 100+ luns on a host, overall performance can be degrading due to management overhead. I have seen this many times.
Chad is correct if you have Yoda doing the SAN. If they just give you a RAID 5 with 14 drives in it, you have been warned.
If extents are used properly, it can be very powerful. I hope to complete a detailed White Paper on this topic.
Able
Posted by: David Able | April 06, 2009 at 12:38 PM
First off, let me say great article. I agree 32 VMs per volume is a joke- it wasn't uncommon for me to a hundred vmdks on a volume, it's all about understanding workloads.
I think the SCSI lock example actually presents a somewhat less-than-real-life point of view. I agree, with only 2 hosts attached to a datastore, and only 1 host creating a VM, the lock performance hit is small. However, one thing that causes a GREAT DEAL of lock activity is running snapshots (and backup tools that use them, we know the ones). Now imagine a "mid-level" ESX farm with 8 hosts, let's say 6 VM's each, all backed to a single VMFS. Turn on snapshots for 10% of those, so 5 or 6, and now create a VM that takes 15 mins. Bet you'll see more than 7% performance loss...
My point is, locks do matter. And making sure you keep good practice around things like snapshotting you'll probably be fine. However, you still need to be concious of it, and scale can be impeded without knowledge and attention during architecture.
Posted by: GP | April 08, 2009 at 05:02 PM
Can you please provide a reference for the statement that you no longer have to fill a LUN before additional LUNs are used when you add an extention? I have a client interested in this configuration, but we have not been able to track down any documentation of this behaviour for vmfs-3.
Posted by: Kurt Lamoreaux | April 24, 2009 at 07:36 PM
The VMFS-3 resource manager will use any and all extents that make up a spanned VMFS volume when it comes to allocating new space.
The resource manager bases its block allocation decisions on a variety of factors and I can't elaborate on the exact details. However, the net effect is that blocks from any LUN in a spanned volume may be allocated at any time; the exact sequence varies by volume, connectivity, sequence of events, etc. In other words, it is not true that one needs to fill up a particular LUN to force block allocation from the next LUN in a spanned volume. This holds for all versions of VMFS3 and all versions of ESX that supported VMFS3.
Seems like we should include this in our documentation for posterity, but for now, you'll have to take my word for it -- I wrote the code (with a few cronies).
PS: Nice post, Chad.
Posted by: Satyam Vaghani | May 13, 2009 at 05:38 PM
Chad, great post but just as you were incensed by Robin's comments, you stated "it’s not the evil FUD that is intentionally incorrect like competitve team crapola, at least it’s innocent)." The EMC Competitive Team never puts out crapola. Geez Man.
Posted by: Andrew Linnell | June 11, 2009 at 12:36 PM
Sorry Andrew - of course I was thinking about the "other guy" (not any competitor in particular) as I said that sentence, but that's a fair critique - others would say the same in reverse I'm sure.
You guys do try hard, and I didn't intend to be hurtful - I apologize.
It's hard gig, because inherently you have to work not only to find the strength in what we have but find the weaknesses of others and contrast it - which tends to become negative.
I have found that almost all competitive content ends up out of date, and all those "checkmark tables" don't stand up to tight scrutiny (because the author of the table picks the rows) - it's so hard to keep current.
That said - the recent work of the team on the ONE portal is very, very good, and I find more fresh and current.
Posted by: Chad Sakac | June 12, 2009 at 12:19 AM
Chad,
Duncan pointed out your take on VMware extents. I put up a quastion on VMware Communites, and I would love to hear your suggestion. THX!
http://communities.vmware.com/thread/224616?tstart=0
Posted by: Mike | August 06, 2009 at 07:44 AM
Thanks Mike - commented on the post. You should strongly consider multi-extent VMFS, I think it would help you in your case.
Posted by: Chad Sakac | August 06, 2009 at 07:14 PM
“Spanning VMFS across multiple extents is a really, really bad idea, and don’t buy you anything because they are just concatenated”
Well I just ran into this issue, try re-signaturing a VMFS volume with multiple extents that were created on ESX3 on an ESX4 host (post upgrade). It appears that it only re-signatures the first extent and it throws an error re-signaturing.
http://communities.vmware.com/message/1403260
Posted by: Jon Bohlke | February 23, 2010 at 02:06 PM
Just a comment... a lot of your perspectives are fairly skewed, and some of your conclusions are incorrect. Let me elaborate.
1. "VMFS doesn’t scale because of locking"
A SCSI-2 Reservation is an exclusive lock of a LUN. When a LUN is reserved, only the host which currently has the reserve can write to the disk. All other I/O from all other hosts will receive a reservation conflict and will have to retry their I/O. These are facts, and for any technical person there is no debating the effects of this architecture. SCSI spec's can be found here if you want to learn more about how a reserve work: t10 org
Blocking of other hosts ability to perform I/O negatively impacts performance, and most importantly is the scalability considerations. Meaning, more VM's... more hosts... equates to more reservations... and greater impact. It's an exponential problem. The performance example you cite is only of a 2 host system with a single VM, which has no relevance to scale. Meaning, you cannot draw a scalability conclusion... with no scale in the measurement. Hence your conclusions are incorrect. In fact, I would have assumed that the SCSI reservation would have no impact to VM's running on the same host (as it's locking the LUN exclusively for that hosts accesses). The test data you cite actually exposes the opposite, so in my book the problem is actually worse than I had originally thought. So you have actually helped educate me of exactly the opposite of your argument. It's outlined that performance impact in the most simplistic scenario possible is measurable (and arguably significant), which will have considerable impact in scale.
Additionally, I find your statement that I’m not supposed to start a VM, stop a VM, create a VM, delete a VM, backup a VM, vMotion a VM, ever use a dynamic VMDK, or extend a VMFS volume... unless during a maintenance window to be ridicules at best.
2. "Spanning VMFS across multiple extents is a really, really bad idea, and don’t buy you anything because they are just concatenated"
I consider this to be a religious debate. Some people would argue that spanning volumes (VMFS or otherwise) is a terrible thought. And they could tell horror stories into the wee hours of the night... additionally I could build a very solid argument and cite numerous of examples in why it should never be done. However, it is a feature that does provide value and some people like it and are successful with. So I'm going to deem this only to be personal opinion, and you are entitled to your opinion.
3. "You can only have 32 VMs per VMFS"
For starters, your statement that there is no limit is incorrect. VMware has published maximums here: vsp_40_config_max.pdf
Most importantly, the reason there are strong recommendations in the ecosystem to limit the number of VM's placed on a VMFS volume is associated with the scalability issues resulting from SCSI reserve locking (item #1 above). So your conclusions are completely incorrect. You are using a completely separate set of data, to compare to why the scalability recommendations exist. The test data you reference is with a system that is up and running with zero metadata operations occurring (aka. no locking). Again, I would like to thank you... you have convinced me that things are actually worse than I thought. I thought that the SCSI reservation exponential scalability issues were the only concern, but now you highlight that core VMFS itself has scalability considerations I should be concerned with on top of that.
Frankly I find your comment of "good enough" to be just silly... I know very few of my customers who would accept that answer.
My reminder to others out there who are accepting this blog as simple fact, that it is (just like any blog) only opinion. YOU should test and evaluate what's right for your environment when deploying. As this blog is spreading just as much FUD as it is trying to defuse; and I personally find it quite hilarious that people are leveraging it in thesis work.
Posted by: John | March 12, 2010 at 07:18 PM
@John - thanks for your comments.
First of all - just to be clear, I ran the post (prior to posting) directly via the VMware engineering team to ensure technical accuracy, so I stand by everything in the post (some things have gotten a little dated, but more in the sense of just ongoing progress/improvement, not incorrectness.
Next, let's get down to the arguments.
I never said that SCSI reservations aren't one of the scaling factors for VMFS, but that the FUD way of saying it is “VMFS doesn’t scale because of locking”.
My point was that incorrect understanding (and legacy views) have led many folks to overestimate this effect (the source of the "no more than ____ VMs per datastore" position that people passionately defend.
My point on the duration of the lock is that the lock is only held (affecting IO to adjacent hosts) while the hidden lock file is modified, not for the duration of the operation (for example creation of a VM). These are very small time periods.
There are many, many customers using VMFS-3 datastores with many, many VMs, undergoing many, many metadata updates (add to the original list extensions of thinly provisioned VMDKs).
The position held by many ("VMFS doesn't scale due to locking", often stated as "don't put more than 12/20/32 VMs in a datastore") is not an accurate statement.
I'll go one step further though - the SCSI locking mechanism isn't ideal. There ARE points where excessive reservations can cause problems.
One area of work here is the new T10 SCSI commands around Atomic Test and Set. http://www.t10.org/cgi-bin/ac.pl?t=d&f=09-100r0.pdf; http://www.t10.org/cgi-bin/ac.pl?t=d&f=09-100r1.pdf
(and yes I follow all the standards bodies, please don't cast aspersions on me, I'm not on you)
This changes the locking mechanism from being the LUN to being only the modified extents (if the SCSI target supports this). Early testing on this (which I showed at VMworld in the VMware/EMC supersession) has shown that scaling of VMs per datastore (and handling during periods of very high congestion) was flat and linear into hundreds of VMs with hundreds of metadata updates in very short periods.
This will be supported in the next vSphere release under the vStorage APIs for Array integration, and every storage vendor has the opportuity to support this improved model (will be supported on CX4, modern NS arrays, and V-Max - possibly also DMX)
You also certainly took my final recommendation completely out of context.
I said "Look – it’s a best practice to schedule those operations, but you can ABSOLUTELY do them.".
You said I said:
"that I’m not supposed to start a VM, stop a VM, create a VM, delete a VM, backup a VM, vMotion a VM, ever use a dynamic VMDK, or extend a VMFS volume... unless during a maintenance window to be ridicules at best." (spelling errors yours, not mine, but hey, it's a blog comment, I make spelling errors all the time).
OK - I should have been more clear: "if you're doing an operation that you know is going to trigger an overwhelming number of metadata updates, you should schedule it, but you can absolutely do those tasks during normal operations".
On the extent question:
To me, you are making my point. The view that multi-extent VMFS volumes are crazy is mostly rooted in the fact that in ESX 2.x it WAS crazy, and there were horror stories.
The impact/risk of a losing an extent in a multi-extent VMFS is the same in MOST cases as a single extent scenario.
- lose first extent - lose datastore (same)
- lose any other extent - lose access to the files on that extent, remaining extents are accessible.
- it is possible (though VMFS allocation works to avoid this - allocation has a preference for host affinity and sequentiality of files) for a single file to span an extent boundary. The probability of this causing file-level corruption is no greater or less than a file within a single extent being corrupted when the extent is lost/removed
I'm sorry, but I disagree with your conclusion - best practices change with the times, and because someone has a horror story in the past is not a reason to not reexamine as underlying architecture change.
Like the first topic (VMFS locking mechanics), there is more to be done here. Ideally, VMware will update the LVM to move from CHS to a GPT-like model, which would enable single larger devices, as most storage arrays can support LUN sizes larger than 2TB.
On to the last one - number of VMs per datastore.
You're right, this is governed by locking, but this doesn't correlate with with "number of VMs", or "amount of IO" (which people immediately think of). Rather it is dependent on "frequency of metadata updates which trigger a lock against the hidden lock files", which is a much more difficult thing to grasp.
A VMFS volume being used by lab manager with very light IO workloads in the VMs will be under much more "reservation pressure" than 100 VMs which are driving a pile of IOps/MBps and are periodically snapshoted, created/deleted, extended.
Until VAAI ATS is out, these cases of relatively low bandwidth (except if using 10GbE), lower latency requirements, but tons of metadata updates often work well on NFS datastores, something EMC certainly embraces (and man, I sure do).
The "Maximum" listed in the document you reference (256 VMs per datastore) is not a hard limit (will double check to verify this), the actual limit is much much higher (a ridiculously high number).
BTW - storage targets handle these SCSI reservation conditions sometimes less gracefully than others, and don't release the reservation (this ends in a bad state). This usually occurs when the queues are also very deep.
This article discusses this topic in more detail.
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1005009&sliceId=1&docTypeID=DT_KB_1_1&dialogID=1034093&stateId=0 0 681678
For anyone who is doing thesis work, I would highly recommend Satyam Vaghani's session at VMworld TA3220 which discussed many of these specific topics.
Re my recommendation at the end is that in my experience, customers tend to over think this.
The most important design principle in VMware use cases (WARNING THIS IS MY OPINION!) is flexibility, and being able to understand and identify when you've hit an operational limit what to do about it.
You can change almost everything non-disruptively, but industry vets have been "trained" to try to nail down every config element, which invariably leads to over-engineered configurations.
Do limits exist - yup. Just like anything.
Lastly - advice (take it or leave it, I bet you'll leave it), the art of dialog and mutual learning (I have many things to learn from everyone, and in my experience, the same applies to every human) **depends on being able to disagree without being disagreeable.**
Perhaps a nice bottle of wine on a Friday, and chill out a little?
Oh - BTW - all that "custom thesis" set of comments are plain-jane spam, I have to delete them, but have my day job, so batch it up - Typepad's anti-spam catches a lot, but a lot still creeps in.
Posted by: Chad Sakac | March 12, 2010 at 09:24 PM
Thank you for retracting both that leveraging SCSI reservations is not ideal and in fact does impact scalability; and that the per VM limits are founded in the impact of reservations.
On topic #2, I won't dispute that some of the extreme fear of extents with VMFS is dated. As I said before, spanning volumes is a religious debate... and is not a debate unique to VMFS. Even if things are 'better' now, some people would still never do it for many valid reasons.
A comment on your point that the impact is minor... The fact that the entire LUN requires it be locked to a single host blocking out all other systems in a distributed environment has unquestionable impact. While the duration of the operation on the lock file may be small, don't forget that a Reserve and subsequent Release is not a lightweight operation. The impact is not negligible, as the data from your own blogs example in the most trivial scenario possible illustrates. In addition there are broadly embraced burned-and-learned best practices from both admin's and major vendors that have evolved from being impacted by these scalability inhibitors which have validity.
Reserve / Release was deprecated in favor of PR's in the most recent SCSI standards, so VMFS needs a new mechanism in the future. That's good news to hear that there is a better solution on the horizon. However, SBC-3 has not yet been ratified. So there will be some time before the new proposal is finalized. Additionally, after it is a formal industry standard; adoption and available firmware updates from storage vendors do not happen quickly. Sure I can see how it was easy for VMware to get EMC to do early adoption for prototyping... but I'm not sure that applies broadly. For the overall storage ecosystem, I suspect we are looking in the context of years here. Don't forget that vendors will not want to update legacy arrays. Which equates to interoperability / lack of support problems. So I question if VMware can take a hard dependency on something so new and getting broad support... but if they can, that's great. Maybe they consider only being compatible with EMC storage to be good enough.
However, the fact that there are plans to make this better in the future... does not change the situation for today that there is legitimacy to these issues to be taken under consideration when thinking about planning the scale of a deployment and being smart around trying to minimize impact. However, I do give you props for attempting to redirect to a positive note on this topic.
Posted by: John | March 15, 2010 at 03:55 AM
We've configured a multi-extent datastore sitting on top of an EqualLogic PS4000. Each of these extents are backed by individual Volumes on a Dell EqualLogic box. Two of these volumes are utilized at 95%, and three at 85%, but the actual used space on the VM datastore is just around 50%.
EqualLogic will shut down a volume when it reaches 100% utilization. Will VMFS fill a volume to 100% while space is available elsewhere? Unfortunately the direction we're getting from VM support is:
"...extents are an older part of Vmware technology that we discourage customers from using. The reason for this is the management complexity that extents bring, plus the risk that extents add (if a single extent fails, it can impact the entire datastore)."
Help!
Posted by: Mike_cohen | March 02, 2011 at 02:35 PM