Recently was working a customer case with my VMware colleagues where a customer was seeing that cloning operations were taking a lot longer on their V-Max than it was on their mid-range CXes. Turned out to be a tricky case, and taught me something new.
This experience would apply across more than just EMC arrays, so I thought I would share what we found, how we found it, and what we did about it – in the hopes of helping folks our there.
If you’re interested – read on….
I’ve got a mixed readership – some very VMware-centric, some storage-centric, and some brave open minded souls who span both. Storage-centric folks – bear with this for a bit, as you likely know this, but important to have the VMware-centric folks understand.
There are as many different storage array architectures out there as there are storage vendors (and in some cases, single vendors have more than one architecture). I’m not talking about small differences either – these are really fundamental differences.
It’s these major differences that make direct platform to platform comparisons hard.
WARNING: I’m going to have to make some sweeping generalizations (that will cause some people will get bent) if I have any hope of making this less than a book – intent here is to help customers, not position anything.
One “trend” however that tends to apply is the idea of “mid-range” and “enterprise” design points. The characteristics of the two categories aren’t black and white, and within the two categories, there are wildly different architectural models. In spite of that, there are some consistent customer use cases that define the categories. Note that this is focused on block storage models using VMFS. With NAS models, these concepts exist, but are internal to the NAS platform.
Mid-Range Block Storage targets usually:
- expect “up to hundreds” of hosts to fan into them.
- expect “up to low thousands” of host LUN objects (and have local and remote replication)
- expect “hundreds” of disks, with low thousands at the current max scale
- support open systems, and not mainframes.
- don’t have to have linear performance in degraded conditions
- Desire non-disruptive operations/maintenance, but some maintenance downtime is acceptable
Enterprise Block Storage targets usually:
- expect “up to thousands” of hosts to fan into them.
- expect “up to tens of thousands” of host LUN objects (and have local and remote replication)
- expect “thousands” of disks
- support open systems and mainframes.
- need to have linear performance in degraded conditions (this translates into “any given I/O needs to be potentially serviced across a lot of possible ways”). Note that this means more than “scale out”. In an EqualLogic, Lefthand Networks, or IBM XIV, while they are “scale out models”, the block of data exists behind one of the “brains/ports”, not “any brains/ports”, and the cache models are also local only to the given “brain”. Again – not bad/good, just different.
- Require non-disruptive operations/maintenance in any case (partially due to scale, and also due to some of the host use cases where they tend to be applied)
There are tons of other features that also matter (Thin, Dedupe/Compression, non-disruptive reconfiguration, tiering mechanisms, consistency technology – the list is huge) – but the above are “broad swath” design goals.
“Mid-range” doesn’t mean better/worse than “Enterprise”, just different. For different use cases, and different customers, perhaps one might be more ideal than another. For some customers, it’s an “and”.
Mid-Range array examples are things like IBM DS5000s, HP EVAs, EMC Celerra and EMC CLARiiON, NetApp FAS, HDS AMS. These share some hardware architectural common elements (redundant “storage brains”). Others that use “scale out” models like Dell EqualLogic and HP Lefthand are architecturally different, but their target “use cases” (ergo target market) is not dissimilar.
Enterprise array examples are things like IBM DS8000, EMC Symmetrix, HDS USP.
There are also some that straddle (I personally would lump 3PAR in that grouping).
So… Why is this important, and what does it have to do with that customer case?
It’s not by definition intrinsic, but the design goals of Enterprise arrays (look at the list, particularly the “scale” and “any given I/O needs to be potentially serviced across a lot of possible ways” ones) lead to different performance envelopes than the midrange.
This doesn’t always translate to “faster”.
With a given amount of spindles…
- Enterprise arrays are typically able to drive more IOps (and therefore also MBps with a given IO size) with low number of I/O streams that are very random with small I/O sizes (4-64KB IO sizes)
- Midrange arrays are typically able to drive more IOps (and therefore also MBps with a given IO size) with low number of I/O streams that are more sequential and larger I/O sizes (4-64KB IO sizes)
- As the number of simultaneous I/O streams increases, the architectures of Enterprise arrays tend to start to pull away.
If you apply this to the customer case…
Q: “why did the CX4-960 clone a large VM 5 times faster than their V-Max?”
A: “A VMware-level clone or deploy from template looks (at the array level) like a big sequential, large block, single I/O stream”
Furthermore, in going from ESX 3.0 to ESX 3.5 – the I/O size during large file copy operations increased dramatically – increasing to up to 32MB in some conditions. Now, some arrays do better with that, and some do worse. Typically mid-range array designs do better, and enterprise designs do worse.
Recognizing the symptom: this just shows up as “slow performance during clone (but not in steady state)”.
- vSCSIstats looks looks OK during normal IO operations (so it’s not just an “underprovisioned backend with insufficent spindles or meta objects”).
- ESXtop shows a low QUED (so it’s not a “microsbursting problem causing ESX queue back-off” – read up to understand that here).
- The back-end analysis of the array shows that the spindles aren’t working too hard, and neither is the front-end.
With this particular V-Max configuration, we measured the sequential I/O bandwidth with a series of I/O sizes:
- 1MB – 38MB/sec
- 512k – 38MB/sec
- 256k – 140MB/sec
- 128k – 135MB/sec
- 64k – 120MB/sec
This is with the same back-end storage configuration, just changing the I/O size. Also note that this is a single-threaded workload. In the “real world” the array is servicing many I/O streams from many hosts at once, and the upper total limits of a V-Max aren’t measured in MBps, but in many GBps. But, for any given I/O stream for a single LUN from a single host – you can see that bandwidth (which will determine the “amount of time to clone the VM”) – very large I/O sizes will reduce the bandwidth that can be acheived.
On a mid-range array, the reverse tends to be true.
The customer did their own testing with VMware and EMC working hand in hand and found the following:
- 256k with 4 jobs – 59 minutes (V-Max), 25 minutes (CX)
- 128k with 4 jobs – 33 minutes (V-Max), 25 minutes (CX)
If the number of clone jobs increased further, with the 128K IO size, I would expect that the V-Max would pull out ahead.
Now, before everyone starts to pile on….
- These were large virtual machines.
- the amount of time to clone (in both configurations) would drop by HUGE amounts if they use zeroed-thick (see post here).
- the amount of time to clone (in both configurations) would drop by LARGE amounts if they were able to drive the I/O to more front-end ports and directors (using round robin)
Ok – so, there is a good KB article on this here:
In this case, using this advanced property setting on VI3.5 dramatically improved their “time to clone” a VM. While this was an EMC Midrange to EMC Enterprise case, the same behavior would be generally true in most mid-range to Enterprise array models.
There’s an important epilogue here…
- I am not recommending that customers go out and make this change in general, but rather putting the info out there for education purposes. If you are not happy with performance and think something is wrong, call EMC or VMware support. We’ll open cases on both sides and resolve any issue.
- Decreasing the IO size meant faster clone – but there is a tradeoff. It means more IOps, which means more ESX host CPU consumption during the clone operation.
- The EMC engineering teams are working to make it so that this setting change would not be required in any case.
- This setting is not available in ESX 4.0. In vSphere, VMware completely changed the whole underlying part of the vmkernel code used for these mass data movement operations – a cool more modular design (called a Data Mover).
- In the NEXT version of vSphere, arrays can hardware-offload this data copy completely (this is ONE of the vStorage APIs for Array Integration APIs) – in otherwords the Data Mover recognizes a “VAAI compliant” array and then hardware offloads that block operation to the array, which can batch together the moves and do it much, much faster. NetApp and EMC Celerra have the ability to do this now for file-level objects on NFS datastores (since we can do file-level snapshots). This is something different – this will literally be a “hardware offloaded vSphere operation” (will apply to many operations – clone, template, storage vmotion, etc.)
Thanks – hopefully this helps some customers out there!
Tweaking and upgrading our Clariion array got our clone times from 40 minutes down to 2 minutes. Our engineers almost bought me a cake.
Primary tweaks: Using a stripled metalun to get give our datastores more spindles (that got us to 20 minutes), and upgrading our SP to enable write caching (that brought it to 2 ;)
Posted by: Stu | February 25, 2010 at 11:35 AM
I have been battling this very issue since upgrading to ESX 4.0. My SAN is not exactly on the HCL, but with ESX 3.5, I was able to push cloning operations to 120 MB/s. With ESX 4.0, they run at about 5 MB/s. This sheds a lot of light on why that happens, so a big thank you for that. Now, can you think of any way to tweak the Data Mover in ESX 4.0?
Posted by: Matt Meyer | February 25, 2010 at 11:44 AM
So I was just at this customer the past two days installing their new Hitachi USP-V. We were walking through things doing some KT and they mentioned that their new litmus test for performance of their storage systems was a VMware clone operation due to performance issues they found on their V-Max boxes. They stated that their Clariion boxes were way faster than the V-Max at this operation and we put a little wager on whether the USP-V would be faster than the Clariion.
I took that bet and ended up winning. ;) The clone operation finishes in a consistent 10 minutes or less. The resulting time is what the customer called "untuned". We didn't mess with the VMware I/O size and left everything at the defaults. I'm sure we could improve things if we tuned things to the 512KB stripe boundary on the USP-V.
I just thought it was funny that you posted this today because I was getting ready to write something up the next couple days on it. Thanks for saving me the time! Great article!
Posted by: Ron Singler | February 25, 2010 at 05:16 PM
No matter how you try to spin it, EMC V-MAX should be smart enough to recognize a sequential stream and optimize for it, it's not exactly new technology :).
And you even call it V-MAX, the best virtualization storage...
I wonder how it performs with a MS SQL server doing full table scan due to a select query for a big report...
Posted by: Mihai | February 26, 2010 at 03:14 PM
Great thread. I think I have this problem with vSphere and a CX3-10c. My vRanger performance when doing iSCSI backup from CX3 is terrible. Half as good as doing non-iSCSI backup.
Conversely I have an Equallogic PS6000XV and MD3000i all in the same fabric and they don't have performance issues. In fact non-iSCSI backup are half as fast as iSCSI.
any ideas?
Posted by: kraigk | February 27, 2010 at 12:24 PM
I actually have the opposite problem on midrange virtualized array, vsphere is able to push enough IOPS during a clone that it can actually push up response time on the volume, can't wait for I/O DRS =)
Posted by: Andrew Fidel | March 01, 2010 at 01:58 PM
Has anyone been able to identify the reason for this behavior?
Posted by: Toudin | March 10, 2010 at 05:40 PM
Interesting post, overall I'd agree with your categorisation of Mid-Range vs Enterprise, though in the past I'd always thought that DMX was better suited to sequential workloads and single threaded large I/Os due to its mainframe heritage with the associated tendency to do batch processing, and that mid-range boxes typically had the advantage when it came to cache-hostile random I/O.
I based that assumption on a casual conversation almost a decade ago, where someone said that Exchange 2000 workloads were fairly toxic to early DMX implementations. I brought up that conversation with an EMC engineer who said that a revision of Enginuity addressed this. Even though this was based mostly on hearsay, it appears maybe this cloning situation is another example of a mismatch between the design center for the array and the particular workload.
I wonder if this had test had been done on an old-school hand tuned DMX whether the result would have been the same as I suspect that the design center for the Vmax workload is more highly optimised for small block random I/O than the previous Sym.
For single threaded workloads little law determines overall throughput, and the additional overheads associated with a scale-out architecture (even when measured in microseconds) can really drop sequential throughput, but I'm surprised that the impact is as large as it was.
As a matter of interest, did the customer try using PowerPath ? Your mention that using round robin would have improved things dramatically inferrs that they didnt, but if they'd gone to the expense of installing a V-Max why not complete the picture ?
Posted by: John Martin | April 23, 2010 at 10:39 PM
John - thanks for the comment. The customer was using VI3.5, so use of Round Robin or PP/VE was out of the question, unfortunately. In the testing completed, we did show that going to vSphere 4 and using Round Robin or PP/VE did ultimately drive much higher bandwidth.
Posted by: Chad Sakac | April 23, 2010 at 10:48 PM
Great thread. I think I have this problem with vSphere and a CX3-10c. My vRanger performance when doing iSCSI backup from CX3 is terrible. Half as good as doing non-iSCSI backup. http://www.hotfilemediafire.com
Posted by: James | September 06, 2010 at 05:00 AM