« Nice to see that | Main | Incoming treat for EMC CLARiiON/Celerra+VMware Customers »

March 31, 2009

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Aaron

AMEN to that post brother. It's nice to have your weight behind it ... NFS, like VMFS, IS JUST A FILE SYSTEM PEOPLE!! (I know NFS is a protocol backed by a filesystem, but you get the point) ... the world has not changed because "someone" has willed it to. Queues are everywhere, FS locking due to metadata updates are everywhere, optimizing i/o dispersion is everywhere, scaling considerations are true across the board. It's about balancing simplicity with scaleability. Sometimes the cookie cutter approach is "Good enough" and you will be equally served regardless of the implementation you select, but it is always good to see through the fog of FUD and know when it is time to put on your "big boy" hat and get down to business building a truly scalable solution. It's also nice to have a solution that is both turn-key easy and which provides you with the tools that you need at your disposable when that time comes.

Paul Wegiel

My $0.02... when asked about VMFS volume "vm capacity" by our customers I always aim to spread the VM's equally among all the available VMFS's. On a few occasions I recommended single VM/VMFS volume (high IO -demanding apps) but the rule of thumb is always to spread them around to maintain the IO down.

On another note... I am sure you've seen Intel releasing X5500 yesterday (the server flavor). What is your take on Nehalem's benefits in virtualization? I think it will be massive - I can't see many bloggers getting excited about this chip which I find a little surprising considering the tremendously positive improvements it has specifically in the realm of virtualization... I would be curious to find out to what extent vSphere will take advantage of the new virtualized hardware instruction set in Nehalem. Do you know if vSphere will take support of Nehalem further than what was just announced in ESX Upd.4?

Best,

Chad Sakac

On the first comment - that's one design philosophy, and a good one Paul. KISS.

As customers get a bit bigger, I personally like to augment that by saying: "have two standard VMFS container definitions - a 'cheap n' dirty" and a "heavy". Virtually provision the backend storage to be efficient in both cases. Start efficient, and you will stay efficient. Use specialized VMFS containers or RDMs for focused use cases (and in the database use cases, consider Thick - Thin/VP buys you little there). Every cluster, IMHO should have at LEAST one NFS datastore - as they're REALLY handy dandy"

Paul - there are a LOT of Nehalem optimizations in vSphere - and other ones too (like the Intel I/O AT optimization in VMDirectPath).

EVC for example (on Intel) uses the Nehalem FlexMigration. As we (EMC) migrate to Nehalem over the next while, there are LOADS of optimizations we can make.

The day is coming soon!!!

Stu

another excellent post. Just on the current vCPU limits, it's actually 192 on 3.5 (but defaults to a lower value). Oddly enough the max number of VM's per host is "only" 170 however (mind you, we have hit that limit with non-prod 32 core / 128GB RAM boxes).

Next version of ESX will indeed be higher, but I dare not say the number in case I get in trouble ;)

Keith Aasen

Good post Chad, it is interesting to see someone defend block level technologies. I usually have to defend NFS from all the FUD surrounding it! I even did a recent post on all the FUD that swirls around NFS and VMware. I'm with you though, both Block Level and File Level Protocols work great for VMware and each have their perks and advantages. Interesting bit about extents I still cringe when I think about them but you are right, that was in the VMFS-2 days. VMware documention will have to catch up about their best practices on number of VMs per VMFS datastore though as most of the docs I see still refer to a smaller number of VMs per VMFS datastore. You are absolutely right though that with NFS you want to focus on the back end disk, insuring you have enough horsepower there.

Adam

Chad,

I have been digging around and cannot find any information about maximums in vSphere. I see you reference that the LUN limit will still be 2TB in vSphere. Do you know if the 2TB limit will also apply to RDM in vSphere? Thanks for the excellent post.

David Able

There are dependencies to changes that have been made in 3.5 that are not present in 3.0. I am currently researching to determine when the changes were made to the algorithm on placement per extent. The statement about the brown belt is 100% true. After going into countless environments that utilize extents without proper SAN design that are experiencing massive problems, please do not underestimate the potential problem. If you manage 20 or fewer VMs per LUN with multiple extents per datastore, your going to be fine. Just always remember that the basics apply. The response time of the array is allways the most important factor. If things are moving fast from a reservation request to response, there will not be any problems from that perspective. If you do get into a reservation problem, remember that the problem expands exponentially -- it basically feeds on itself.

True parallelism will not occur until multiple paths can be used with "PowerPath", etc.. We still have only one queue on the HBA for Read/Write request. You can split I/O on Active/Active arrays across multiple HBA's with manual effort. If you get extent happy, you will run out of LUN capacity per cluster. You can only have 256 LUNs per host. This translates to that per cluster. Once you get over 100+ luns on a host, overall performance can be degrading due to management overhead. I have seen this many times.

Chad is correct if you have Yoda doing the SAN. If they just give you a RAID 5 with 14 drives in it, you have been warned.

If extents are used properly, it can be very powerful. I hope to complete a detailed White Paper on this topic.

Able

GP

First off, let me say great article. I agree 32 VMs per volume is a joke- it wasn't uncommon for me to a hundred vmdks on a volume, it's all about understanding workloads.

I think the SCSI lock example actually presents a somewhat less-than-real-life point of view. I agree, with only 2 hosts attached to a datastore, and only 1 host creating a VM, the lock performance hit is small. However, one thing that causes a GREAT DEAL of lock activity is running snapshots (and backup tools that use them, we know the ones). Now imagine a "mid-level" ESX farm with 8 hosts, let's say 6 VM's each, all backed to a single VMFS. Turn on snapshots for 10% of those, so 5 or 6, and now create a VM that takes 15 mins. Bet you'll see more than 7% performance loss...

My point is, locks do matter. And making sure you keep good practice around things like snapshotting you'll probably be fine. However, you still need to be concious of it, and scale can be impeded without knowledge and attention during architecture.

Kurt Lamoreaux

Can you please provide a reference for the statement that you no longer have to fill a LUN before additional LUNs are used when you add an extention? I have a client interested in this configuration, but we have not been able to track down any documentation of this behaviour for vmfs-3.

Satyam Vaghani

The VMFS-3 resource manager will use any and all extents that make up a spanned VMFS volume when it comes to allocating new space.

The resource manager bases its block allocation decisions on a variety of factors and I can't elaborate on the exact details. However, the net effect is that blocks from any LUN in a spanned volume may be allocated at any time; the exact sequence varies by volume, connectivity, sequence of events, etc. In other words, it is not true that one needs to fill up a particular LUN to force block allocation from the next LUN in a spanned volume. This holds for all versions of VMFS3 and all versions of ESX that supported VMFS3.

Seems like we should include this in our documentation for posterity, but for now, you'll have to take my word for it -- I wrote the code (with a few cronies).

PS: Nice post, Chad.

Andrew Linnell

Chad, great post but just as you were incensed by Robin's comments, you stated "it’s not the evil FUD that is intentionally incorrect like competitve team crapola, at least it’s innocent)." The EMC Competitive Team never puts out crapola. Geez Man.

Chad Sakac

Sorry Andrew - of course I was thinking about the "other guy" (not any competitor in particular) as I said that sentence, but that's a fair critique - others would say the same in reverse I'm sure.

You guys do try hard, and I didn't intend to be hurtful - I apologize.

It's hard gig, because inherently you have to work not only to find the strength in what we have but find the weaknesses of others and contrast it - which tends to become negative.

I have found that almost all competitive content ends up out of date, and all those "checkmark tables" don't stand up to tight scrutiny (because the author of the table picks the rows) - it's so hard to keep current.

That said - the recent work of the team on the ONE portal is very, very good, and I find more fresh and current.

Mike

Chad,

Duncan pointed out your take on VMware extents. I put up a quastion on VMware Communites, and I would love to hear your suggestion. THX!

http://communities.vmware.com/thread/224616?tstart=0

Chad Sakac

Thanks Mike - commented on the post. You should strongly consider multi-extent VMFS, I think it would help you in your case.

Jon Bohlke

“Spanning VMFS across multiple extents is a really, really bad idea, and don’t buy you anything because they are just concatenated”

Well I just ran into this issue, try re-signaturing a VMFS volume with multiple extents that were created on ESX3 on an ESX4 host (post upgrade). It appears that it only re-signatures the first extent and it throws an error re-signaturing.

http://communities.vmware.com/message/1403260

John

Just a comment... a lot of your perspectives are fairly skewed, and some of your conclusions are incorrect. Let me elaborate.

1. "VMFS doesn’t scale because of locking"
A SCSI-2 Reservation is an exclusive lock of a LUN. When a LUN is reserved, only the host which currently has the reserve can write to the disk. All other I/O from all other hosts will receive a reservation conflict and will have to retry their I/O. These are facts, and for any technical person there is no debating the effects of this architecture. SCSI spec's can be found here if you want to learn more about how a reserve work: t10 org
Blocking of other hosts ability to perform I/O negatively impacts performance, and most importantly is the scalability considerations. Meaning, more VM's... more hosts... equates to more reservations... and greater impact. It's an exponential problem. The performance example you cite is only of a 2 host system with a single VM, which has no relevance to scale. Meaning, you cannot draw a scalability conclusion... with no scale in the measurement. Hence your conclusions are incorrect. In fact, I would have assumed that the SCSI reservation would have no impact to VM's running on the same host (as it's locking the LUN exclusively for that hosts accesses). The test data you cite actually exposes the opposite, so in my book the problem is actually worse than I had originally thought. So you have actually helped educate me of exactly the opposite of your argument. It's outlined that performance impact in the most simplistic scenario possible is measurable (and arguably significant), which will have considerable impact in scale.

Additionally, I find your statement that I’m not supposed to start a VM, stop a VM, create a VM, delete a VM, backup a VM, vMotion a VM, ever use a dynamic VMDK, or extend a VMFS volume... unless during a maintenance window to be ridicules at best.

2. "Spanning VMFS across multiple extents is a really, really bad idea, and don’t buy you anything because they are just concatenated"
I consider this to be a religious debate. Some people would argue that spanning volumes (VMFS or otherwise) is a terrible thought. And they could tell horror stories into the wee hours of the night... additionally I could build a very solid argument and cite numerous of examples in why it should never be done. However, it is a feature that does provide value and some people like it and are successful with. So I'm going to deem this only to be personal opinion, and you are entitled to your opinion.

3. "You can only have 32 VMs per VMFS"
For starters, your statement that there is no limit is incorrect. VMware has published maximums here: vsp_40_config_max.pdf
Most importantly, the reason there are strong recommendations in the ecosystem to limit the number of VM's placed on a VMFS volume is associated with the scalability issues resulting from SCSI reserve locking (item #1 above). So your conclusions are completely incorrect. You are using a completely separate set of data, to compare to why the scalability recommendations exist. The test data you reference is with a system that is up and running with zero metadata operations occurring (aka. no locking). Again, I would like to thank you... you have convinced me that things are actually worse than I thought. I thought that the SCSI reservation exponential scalability issues were the only concern, but now you highlight that core VMFS itself has scalability considerations I should be concerned with on top of that.

Frankly I find your comment of "good enough" to be just silly... I know very few of my customers who would accept that answer.

My reminder to others out there who are accepting this blog as simple fact, that it is (just like any blog) only opinion. YOU should test and evaluate what's right for your environment when deploying. As this blog is spreading just as much FUD as it is trying to defuse; and I personally find it quite hilarious that people are leveraging it in thesis work.

Chad Sakac

@John - thanks for your comments.

First of all - just to be clear, I ran the post (prior to posting) directly via the VMware engineering team to ensure technical accuracy, so I stand by everything in the post (some things have gotten a little dated, but more in the sense of just ongoing progress/improvement, not incorrectness.

Next, let's get down to the arguments.

I never said that SCSI reservations aren't one of the scaling factors for VMFS, but that the FUD way of saying it is “VMFS doesn’t scale because of locking”.

My point was that incorrect understanding (and legacy views) have led many folks to overestimate this effect (the source of the "no more than ____ VMs per datastore" position that people passionately defend.

My point on the duration of the lock is that the lock is only held (affecting IO to adjacent hosts) while the hidden lock file is modified, not for the duration of the operation (for example creation of a VM). These are very small time periods.

There are many, many customers using VMFS-3 datastores with many, many VMs, undergoing many, many metadata updates (add to the original list extensions of thinly provisioned VMDKs).

The position held by many ("VMFS doesn't scale due to locking", often stated as "don't put more than 12/20/32 VMs in a datastore") is not an accurate statement.

I'll go one step further though - the SCSI locking mechanism isn't ideal. There ARE points where excessive reservations can cause problems.

One area of work here is the new T10 SCSI commands around Atomic Test and Set. http://www.t10.org/cgi-bin/ac.pl?t=d&f=09-100r0.pdf; http://www.t10.org/cgi-bin/ac.pl?t=d&f=09-100r1.pdf

(and yes I follow all the standards bodies, please don't cast aspersions on me, I'm not on you)

This changes the locking mechanism from being the LUN to being only the modified extents (if the SCSI target supports this). Early testing on this (which I showed at VMworld in the VMware/EMC supersession) has shown that scaling of VMs per datastore (and handling during periods of very high congestion) was flat and linear into hundreds of VMs with hundreds of metadata updates in very short periods.

This will be supported in the next vSphere release under the vStorage APIs for Array integration, and every storage vendor has the opportuity to support this improved model (will be supported on CX4, modern NS arrays, and V-Max - possibly also DMX)

You also certainly took my final recommendation completely out of context.

I said "Look – it’s a best practice to schedule those operations, but you can ABSOLUTELY do them.".

You said I said:

"that I’m not supposed to start a VM, stop a VM, create a VM, delete a VM, backup a VM, vMotion a VM, ever use a dynamic VMDK, or extend a VMFS volume... unless during a maintenance window to be ridicules at best." (spelling errors yours, not mine, but hey, it's a blog comment, I make spelling errors all the time).

OK - I should have been more clear: "if you're doing an operation that you know is going to trigger an overwhelming number of metadata updates, you should schedule it, but you can absolutely do those tasks during normal operations".

On the extent question:

To me, you are making my point. The view that multi-extent VMFS volumes are crazy is mostly rooted in the fact that in ESX 2.x it WAS crazy, and there were horror stories.

The impact/risk of a losing an extent in a multi-extent VMFS is the same in MOST cases as a single extent scenario.
- lose first extent - lose datastore (same)
- lose any other extent - lose access to the files on that extent, remaining extents are accessible.
- it is possible (though VMFS allocation works to avoid this - allocation has a preference for host affinity and sequentiality of files) for a single file to span an extent boundary. The probability of this causing file-level corruption is no greater or less than a file within a single extent being corrupted when the extent is lost/removed

I'm sorry, but I disagree with your conclusion - best practices change with the times, and because someone has a horror story in the past is not a reason to not reexamine as underlying architecture change.

Like the first topic (VMFS locking mechanics), there is more to be done here. Ideally, VMware will update the LVM to move from CHS to a GPT-like model, which would enable single larger devices, as most storage arrays can support LUN sizes larger than 2TB.

On to the last one - number of VMs per datastore.

You're right, this is governed by locking, but this doesn't correlate with with "number of VMs", or "amount of IO" (which people immediately think of). Rather it is dependent on "frequency of metadata updates which trigger a lock against the hidden lock files", which is a much more difficult thing to grasp.

A VMFS volume being used by lab manager with very light IO workloads in the VMs will be under much more "reservation pressure" than 100 VMs which are driving a pile of IOps/MBps and are periodically snapshoted, created/deleted, extended.

Until VAAI ATS is out, these cases of relatively low bandwidth (except if using 10GbE), lower latency requirements, but tons of metadata updates often work well on NFS datastores, something EMC certainly embraces (and man, I sure do).

The "Maximum" listed in the document you reference (256 VMs per datastore) is not a hard limit (will double check to verify this), the actual limit is much much higher (a ridiculously high number).

BTW - storage targets handle these SCSI reservation conditions sometimes less gracefully than others, and don't release the reservation (this ends in a bad state). This usually occurs when the queues are also very deep.

This article discusses this topic in more detail.
http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&docType=kc&externalId=1005009&sliceId=1&docTypeID=DT_KB_1_1&dialogID=1034093&stateId=0 0 681678

For anyone who is doing thesis work, I would highly recommend Satyam Vaghani's session at VMworld TA3220 which discussed many of these specific topics.

Re my recommendation at the end is that in my experience, customers tend to over think this.

The most important design principle in VMware use cases (WARNING THIS IS MY OPINION!) is flexibility, and being able to understand and identify when you've hit an operational limit what to do about it.

You can change almost everything non-disruptively, but industry vets have been "trained" to try to nail down every config element, which invariably leads to over-engineered configurations.

Do limits exist - yup. Just like anything.

Lastly - advice (take it or leave it, I bet you'll leave it), the art of dialog and mutual learning (I have many things to learn from everyone, and in my experience, the same applies to every human) **depends on being able to disagree without being disagreeable.**

Perhaps a nice bottle of wine on a Friday, and chill out a little?

Oh - BTW - all that "custom thesis" set of comments are plain-jane spam, I have to delete them, but have my day job, so batch it up - Typepad's anti-spam catches a lot, but a lot still creeps in.

John

Thank you for retracting both that leveraging SCSI reservations is not ideal and in fact does impact scalability; and that the per VM limits are founded in the impact of reservations.
On topic #2, I won't dispute that some of the extreme fear of extents with VMFS is dated. As I said before, spanning volumes is a religious debate... and is not a debate unique to VMFS. Even if things are 'better' now, some people would still never do it for many valid reasons.


A comment on your point that the impact is minor... The fact that the entire LUN requires it be locked to a single host blocking out all other systems in a distributed environment has unquestionable impact. While the duration of the operation on the lock file may be small, don't forget that a Reserve and subsequent Release is not a lightweight operation. The impact is not negligible, as the data from your own blogs example in the most trivial scenario possible illustrates. In addition there are broadly embraced burned-and-learned best practices from both admin's and major vendors that have evolved from being impacted by these scalability inhibitors which have validity.

Reserve / Release was deprecated in favor of PR's in the most recent SCSI standards, so VMFS needs a new mechanism in the future. That's good news to hear that there is a better solution on the horizon. However, SBC-3 has not yet been ratified. So there will be some time before the new proposal is finalized. Additionally, after it is a formal industry standard; adoption and available firmware updates from storage vendors do not happen quickly. Sure I can see how it was easy for VMware to get EMC to do early adoption for prototyping... but I'm not sure that applies broadly. For the overall storage ecosystem, I suspect we are looking in the context of years here. Don't forget that vendors will not want to update legacy arrays. Which equates to interoperability / lack of support problems. So I question if VMware can take a hard dependency on something so new and getting broad support... but if they can, that's great. Maybe they consider only being compatible with EMC storage to be good enough.

However, the fact that there are plans to make this better in the future... does not change the situation for today that there is legitimacy to these issues to be taken under consideration when thinking about planning the scale of a deployment and being smart around trying to minimize impact. However, I do give you props for attempting to redirect to a positive note on this topic.

Mike_cohen

We've configured a multi-extent datastore sitting on top of an EqualLogic PS4000. Each of these extents are backed by individual Volumes on a Dell EqualLogic box. Two of these volumes are utilized at 95%, and three at 85%, but the actual used space on the VM datastore is just around 50%.

EqualLogic will shut down a volume when it reaches 100% utilization. Will VMFS fill a volume to 100% while space is available elsewhere? Unfortunately the direction we're getting from VM support is:
"...extents are an older part of Vmware technology that we discourage customers from using. The reason for this is the management complexity that extents bring, plus the risk that extents add (if a single extent fails, it can impact the entire datastore)."

Help!

The comments to this entry are closed.

  • BlogWithIntegrity.com

Disclaimer

  • The opinions expressed here are my personal opinions. Content published here is not read or approved in advance by Dell Technologies and does not necessarily reflect the views and opinions of Dell Technologies or any part of Dell Technologies. This is my blog, it is not an Dell Technologies blog.