How’s that for a headline?
We’ve been working furiously with the VMware View team to support the goals of the release: 1) improved TCO; 2) improved user experience; 3) extending the operational benefits of client virtualization.
Today, we’ve published several joint EMC/VMware reference architectures for VMware View 4.5. You can get the first of which (which supports the headline) here:
Read on for more…
MARKETECTURE WARNING!
BTW – you’ll notice a series of joint documents – why? While everyone is in a race to the bottom (storage vendors fighting to illustrate the most cost effective storage for VDI use cases) – we wanted to illustrate how a SINGLE reference design could be nicely scaled across a broad set of workloads (no customer is the same) – and do it efficiently across that band of use cases. That ain’t marketecture – that’s real guidance.
The doc linked above is focused on hyper-cost centricity, and lighter user workloads. The very, VERY important reality is that the cost of storage varies dramatically – where the primary factor is the use case and the client workload. That’s why I think that client characterization is very important in any client virtualization project (we do a lot with Liquidware Labs). That’s a similar goal as the VMware doc here (though they eschewed shared storage, and ran SSDs in the ESX hosts – take their reference architecture, add $38/client, and you get all the upside of advanced VMware features).
But – while I’m an engineer type, I know (sadly) that headlines like the one I used here (hey, it worked, you’re reading, right?) are very important, and in our busy lives, “sound-bites matter”.
We decided that the high road was to BE TRANSPARENT in every way we could, and show the range of workloads and how the reference design could be easily dialed up or down in various configurations, and client use cases. In the use case above – the user workload is 4-6 IOPs, 2250 clients, and is an XP image. That’s a real workload, but is definitely geared towards call-center/kiosk workloads. The effect of leveraging EFDs for base replicas and FAST cache netted out to huge improvements – thousands of clients doing things at once, with 65%-88% of the IO being soaked up before it hit the backend spindles.
Now - when people throw out “x dollars per client”, there’s loads of questions about client, configuration, lifecycle and more. What we wanted to show DEFINITIVELY is that EMC can be uber-efficient across a spectrum of scales and workloads.
(there is an EMC VDI sizing tool that can be used to help design configurations with inputs that you can get from characterizing your clients)
Continuing down this path of transparency – we also published a couple of other reference architectures with heavier user workloads and Windows 7 clients at similar user scales.
In the one above, the workload is much more reasonable for a “knowledge worker” with persistent desktops – here’s the observed (measured) workload during the testing.
Notice how the IOPs is in the 9-37 IOps band. When sizing, we assumed a 45:55 read/write mix (we tend to find client workloads are more write dominated than people naturally think).
The configuration (NS-120, mix of disk types) isn’t different from the lighter workload, just a little more juice (a few more EFDs) to support the heavier workload, few more large SATA disks to support the persistent desktops, a little more use of dedupe/compression on user data.
We’ve used ever EMC storage efficiency technique at our disposal to squeeze everything we can out of a config, and got 2000 clients on a simple, efficient NS-120. If you want tens of thousands of clients, you can scale it up linearly. If you want more density per server, you can increase the server horsepower, using denser servers/more memory per server.
The $38/client metric isn’t a marketing stunt – that’s a real use case, and the fact that we took that same solution and showed how it would support other workloads just proves it. I’ve looked across the industry, and we’re still going toe-to-toe with anyone else on the storage front when it comes to cost/efficiency/experience, and I’m happy to do it publicly.
FYI, there will also be an updated VCE Vblock View reference architecture coming soon with both Vblock 0 and Vblock 1 (remember, both use EMC Unified storage) as the targets, leveraging the new Westmere based blades.
So – how did we drive down the cost so much? In all the cases – regardless of the user profile - the design:
- Leveraged the new VMware View 4.5 capability to tier base replicas, linked clones, and user data on different tiers. We used solid state for the base replica with heavy loads (and FC for the lighter profiles), FC for linked clones, and user data on massive low-cost SATA configs.
- Used all the new efficiency techniques GA now in FLARE 30/DART 6.0 – but found the biggest standout in the VMware View use case to be FAST Cache – a very efficient read/write cache which can scale up to 2TB through the addition of cost-effective solid-state disk. Did people notice, we GA’ed lower cost 100GB and 200GB solid-state disks, and they are not only cheaper, they continue to perform better and better – the march of progress…
- Used the VAAI hardware assisted locking in the VMFS use case to help VM density, particularly during periods that provide mass VMFS metadata updates.
Check out the effect of FAST cache in these “high density IO” periods in the life of a virtualized client using VMware View 4.5 below (you can see similar testing with View 4.0 here). BTW – if you want the powerpoint slides so you have them handy in high-resolution format, you can download them here. (BTW, as always, excellent work to the Aaron Patten and the RTP crew). In the charts below, there’s a baseline (an array configured “plain jane”, not using any of the new efficiency techniques), and then the “reference architecture” metrics.
Boot Storm – note the fact that FAST Cache absorbed 20x of the IOs before they hit the backend – the difference between crashing and burning, and humming along.
View Refresh – Note how important the write cache was – represented by the “red portion” of the lines in the bottom right chart. Also note that the IO workload was about 3x lower due to FAST Cache – that’s a lot more efficient.
Antivirus Scan – This is a pretty dramatic example – the end user would freak out as the ESX response time went up to 200-300ms, as their virtualized client would be a pig. Sprinkle on a little FAST Cache (remember, hot addable :-), whammo, you’re working. The disk IO was reduced by 5x.
View Recompose – At this point, you see the pattern. Reponse time (user experience) 10x better, disk IO reduced by 4x, overall time to complete the Recompose task 25% faster. In other words, goodness.
Guest Patching – yawn :-) yup, more “order of magnitude” improvements :-)
Client logon storm and steady state load – look at the write cache effect. People always underestimate the write workload in the client workload use case.
Take a gander at the docs, leverage them for your design work.
Great work by:
- the EMC Unified Storage Division – FAST Cache and FAST, along with VAAI support rocks.
- the EMC Hopkinton and RTP solutions team along with their friends in the VMware View team.
Thank you folks – I owe you a beer next time I see ya!
Hi Chad,
I have reviewed the documents above and I had a few questions I hoped that you can help me with as follows:
1. Where is the link to the EMC VDI sizing tool (I am hoping this will allow us to accurately size each tier using FAST Cache, EFDs, FC and SATA disks)?
2. The Windows 7 reference architecture seems a little over the top (i.e. 12GB linked clone capacity per desktop and 15 x 450GB FC and 7 x 1TB SATA) - what are the algorithms for sizing each component?
3. How would you scale the Windows 7 reference to support 1,000 or 2,000 users?
4. At what point does the 100GB FAST Cache on the NS-120 become a limiting factor?
5. The disk sizing suggests a peak IOPS of 875 - how is this possible from a desktop that would in the physical world have a single SATA drive that would support about 50 IOPS?
6. The disk sizing calcultations do not take into account the write penalty for RAID 1/0 - is this a mistake?
Many thanks
Mark
Posted by: Mark Burgess | September 01, 2010 at 08:57 AM
Do you have a link to the powerpoint slides for the charts? I'm very interested in the high resolution version of those.
Thanks!
Posted by: Dustan Terlson | September 01, 2010 at 10:30 AM
Chad, with VAAI assisted locking, is there any reason not to use maximum size 2TB (minus 512bytes) datastores only? So can I just put as many VMDKs per datastore that will fit in it and not to worry about locking.
Posted by: Tomi Hakala | September 02, 2010 at 01:18 PM
Hi Chad - a couple more comments on the 500 user Windows 7 reference architecture:
1. Why was it done using FC - is this not overkill for 500 users?
2. What sizing has been done around host bandwidth - at want point are you likely to need to move beyond GbE?
3. What about performance of iSCSI v NFS - which one is better for View?
4. Am I right in thinking that VAAI hardware assisted locking is not required for NFS as it has always had equivalent functionality?
Many thanks
Mark
Posted by: Mark Burgess | September 03, 2010 at 02:34 AM
I too would like the slides - the link isn't linked. Additionally, I'm trying to come up with a list of questions to best help architect the solution. Maybe the VDI tool referenced would help, but can't find it either..
Thanks
Tami Booth
Posted by: Tami Booth | September 10, 2010 at 02:10 PM
@Tami - sorry about that, was underwater with VMworld stuff, posted, knowing that I would go back and link the high-rez stuff and PPTs, should be there now.
The tool we use right now is internal only. Pushing like mad to get it posted externally. If you;re an EMCer or EMC Partner, please ping your local vSpecialist. Please bear with us.
Posted by: Chad Sakac | September 11, 2010 at 10:56 AM
@Mark - thanks for your questions!
1) FC is still the dominant protocol used in vSphere deployments (though iSCSI is the fastest growing, followed by NFS). As Vaughn and I covered in TA8133 @ VMworld, the real question of protocol is "leverage what you've got, leverage what you know, and if you're deploying greenfield, strongly consider 10GbE converged".
2) ESX host-storage bandwidth is really not the bottleneck in the VDI use case (either View or Xen on ESX) in the VAST majority of cases. In almost all cases, you are IOps constrained. the majority of the client virutalization IOs tend to be small (4-64K )unlike for example a backup or a guest doing datawarehousing (which tend to be in the 256K+ IO size range). If you do some quick math, assuming 20 IOps per user, and 8K IO size on average, each user will be driving about 160KBps. That means that about 500 users will saturate a 1GbE link assuming 80MBps unidirectionally (100% read or 100% write).
3) iSCSI vs. NFS - the battle is REALLY over. Pick what works best for you. Historically there was the question of locking - which while blown out of proportion in most cases - there are use case which drive periods of very busy metadata updates. As you note in question 4 - VAAI makes this similar across VMFS (all block protocols) to NFS.
NFS bigots say "VAAI does nothing but make VMFS catch up to what NFS has always had!". I say, that's partially correct, but of course, NFS needs to catch up with the more robust path scaling (NFS v4, v4.1 and pNFS support will bring this), and more robust failover behavior. It's a silly exercise to argue protocols when there are much, much larger and more important design decisions.
The question of protocol is RARELY the thing that makes client virtualization projects succeed or fail, rather it's the end-to-end system design, and finding the right use cases.
That said, we do have an analagous doc were finishing which is an all NFS based design, so customers can deploy what works best for them, as of course we support both.
Posted by: Chad Sakac | September 11, 2010 at 11:54 AM
one question - what LWL report did you run to get your IOPS?
Posted by: Robert Kadish | September 18, 2010 at 08:15 PM
Chad,
I recommend running the average peak IOPS report in LWL UX. I think it will change the sizing model.
Posted by: Robert Kadish | September 30, 2010 at 02:48 PM
If you are wondering why I'm asking which report read through this Citrix blog including comments. It highlights the dangers of sizing based on Average IOPS.
http://community.citrix.com/display/ocb/2010/08/06/Saving+IOPS+with+Provisioning+Services
Posted by: Robert Kadish | October 01, 2010 at 10:37 AM
Your numbers are very confusing. You say this is for XP yet the average IOPS says windows 7? Since most companies are moving to windows 7 and using VDI as a tool to move from XP to 7 why would you test XP?
Looking at the NS-120 array layout you say that it can handle 2250 desktops. Is that windows 7 or XP?
You mentioned the array handled 13,000 IOPS at peak when you were concurrently booting 500 desktops within 30 minutes. How many desktops did you boot per minute?
If this is a configuration to handle 2250 non-persistent desktops wouldn't you need to test a much greater number of desktops booting. The reason I say this is the user configuration your testing suggests a call center where all users would be booting at almost the same time. Probably 3 or 4 times during a 24 hour period. (Shift changes) So the number should be more like 1500 users booting within a 15 minute period.
I know you mention this is not a marketing document but using tools such as Login VSI are not indicative of the real world. For example what is the effect of a cached I.E. session vs. an I.E. session which is not cached. Not to mention that using VMwares new Data Disk could have a huge impact on IOPS.
Also the concern companies have when looking at Average IOPS is they have no way to control how often their users login and out, reboot, and run resource intensive applications.
I mention this because corporations are using reports like this one to size their windows 7 environments and assumptions will most likely lead to a lot of pain.
Posted by: Robert Kadish | October 03, 2010 at 12:12 PM
@Robert - thanks for the questions, and sorry for the delay in approving your comments. The deluge of spam made me kick-in comment approval (just to keep the filth off), but that means I need to be more diligent in rapidly approving anything from a human.
Like I noted (in big bold italics) - MARCHITECTURE WARNING! Using average IOPs is indeed dangerous. The purpose of the XP document was to see "how low can we go?". The use case is real, but VERY narrow. Non-persistent, kiosk-type use cases (called out throughout the doc). In those use cases, XP is still very much used, and the client IOps profile is very different that you or I.
If you look at the second document, it's much more around the design center youre describing. 500 users on a similar config (representing therefore a higher cost/client), Windows 7. There, we used the 95th percentile IOps which was 37 IOps. The efficiency technologies still applied, and the cost was 60% lower than it would be otherwise, but it's still in the $100-$120/client range as opposed to $38/client for the light kiosk-style worker.
If you look through the post and the docs again, I think you'll see I tried to be VERY explicit about the different user workloads.
I will say that the mass boot isn't the problem it used to be. With large caches you can get today at low cost-points on EMC and NetApp storage models, as soon as the first one boots, the remainder are largely handled by cache. The tricker periods are the patch/AV periods in the client. In those cases the caches (EMC FAST Cache as an example) still help, but less so. Patch impact can be usually mitigated through app virtualization but not eliminated, and AV can't practically be eliminated via any method, though NAS-based AV can mitigate a lot of the scan against user content - but only if you're able to give up check-in/out.
Ok - now onto load-generation tools...
I reached out to one of the primary folks who worked on the tests, and here's his commentary:
"
1) RAWC – developed in house by VMware - not really used too widely
2) LoginVSI – developed by LoginConsultants.com – this benchmark is very CPU intensive, but not disk resource intensive, so it makes for a great server benchmark, but not so great IO generator
3) Scapa VDI Benchmark – I have no hands on experience with it, but Cisco is using it for the View 4.5 RA they are doing
4) View Planner – Similar to RAWC, but supposedly less resource intensive to set up than RAWC (based on virtual appliance). (Chad's note: my comment is that this isn't out yet).
The View 4.5 PSG used LoginVSI for the load generation. All previous RAs used RAWC. "
Long and short, creating these sorts of workloads right now is REALLY HARD, and none of the tools are perfect.
I would agree with your feedback if the sizing guideline was based purely around the XP document and we said "apply this broadly!"
I disagree if you look at both docs, and for the use case you describe (Win7, non-kiosk use) leverage the second one, which is designed for that purpose. Used that way, they won't lead to "a lot of pain", but will (IMO) reduce it - as they are more explicit (most people don't even think about IOPS until the pain hits).
Thanks for the feedback!
Posted by: Chad Sakac | October 04, 2010 at 04:26 PM
I'll look - which doc are you calling the second one?
Posted by: Robert Kadish | October 04, 2010 at 05:04 PM