Recently was working a customer case with my VMware colleagues where a customer was seeing that cloning operations were taking a lot longer on their V-Max than it was on their mid-range CXes. Turned out to be a tricky case, and taught me something new.
This experience would apply across more than just EMC arrays, so I thought I would share what we found, how we found it, and what we did about it – in the hopes of helping folks our there.
If you’re interested – read on….
I’ve got a mixed readership – some very VMware-centric, some storage-centric, and some brave open minded souls who span both. Storage-centric folks – bear with this for a bit, as you likely know this, but important to have the VMware-centric folks understand.
There are as many different storage array architectures out there as there are storage vendors (and in some cases, single vendors have more than one architecture). I’m not talking about small differences either – these are really fundamental differences.
It’s these major differences that make direct platform to platform comparisons hard.
WARNING: I’m going to have to make some sweeping generalizations (that will cause some people will get bent) if I have any hope of making this less than a book – intent here is to help customers, not position anything.
One “trend” however that tends to apply is the idea of “mid-range” and “enterprise” design points. The characteristics of the two categories aren’t black and white, and within the two categories, there are wildly different architectural models. In spite of that, there are some consistent customer use cases that define the categories. Note that this is focused on block storage models using VMFS. With NAS models, these concepts exist, but are internal to the NAS platform.
Mid-Range Block Storage targets usually:
- expect “up to hundreds” of hosts to fan into them.
- expect “up to low thousands” of host LUN objects (and have local and remote replication)
- expect “hundreds” of disks, with low thousands at the current max scale
- support open systems, and not mainframes.
- don’t have to have linear performance in degraded conditions
- Desire non-disruptive operations/maintenance, but some maintenance downtime is acceptable
Enterprise Block Storage targets usually:
- expect “up to thousands” of hosts to fan into them.
- expect “up to tens of thousands” of host LUN objects (and have local and remote replication)
- expect “thousands” of disks
- support open systems and mainframes.
- need to have linear performance in degraded conditions (this translates into “any given I/O needs to be potentially serviced across a lot of possible ways”). Note that this means more than “scale out”. In an EqualLogic, Lefthand Networks, or IBM XIV, while they are “scale out models”, the block of data exists behind one of the “brains/ports”, not “any brains/ports”, and the cache models are also local only to the given “brain”. Again – not bad/good, just different.
- Require non-disruptive operations/maintenance in any case (partially due to scale, and also due to some of the host use cases where they tend to be applied)
There are tons of other features that also matter (Thin, Dedupe/Compression, non-disruptive reconfiguration, tiering mechanisms, consistency technology – the list is huge) – but the above are “broad swath” design goals.
“Mid-range” doesn’t mean better/worse than “Enterprise”, just different. For different use cases, and different customers, perhaps one might be more ideal than another. For some customers, it’s an “and”.
Mid-Range array examples are things like IBM DS5000s, HP EVAs, EMC Celerra and EMC CLARiiON, NetApp FAS, HDS AMS. These share some hardware architectural common elements (redundant “storage brains”). Others that use “scale out” models like Dell EqualLogic and HP Lefthand are architecturally different, but their target “use cases” (ergo target market) is not dissimilar.
Enterprise array examples are things like IBM DS8000, EMC Symmetrix, HDS USP.
There are also some that straddle (I personally would lump 3PAR in that grouping).
So… Why is this important, and what does it have to do with that customer case?
It’s not by definition intrinsic, but the design goals of Enterprise arrays (look at the list, particularly the “scale” and “any given I/O needs to be potentially serviced across a lot of possible ways” ones) lead to different performance envelopes than the midrange.
This doesn’t always translate to “faster”.
With a given amount of spindles…
- Enterprise arrays are typically able to drive more IOps (and therefore also MBps with a given IO size) with low number of I/O streams that are very random with small I/O sizes (4-64KB IO sizes)
- Midrange arrays are typically able to drive more IOps (and therefore also MBps with a given IO size) with low number of I/O streams that are more sequential and larger I/O sizes (4-64KB IO sizes)
- As the number of simultaneous I/O streams increases, the architectures of Enterprise arrays tend to start to pull away.
If you apply this to the customer case…
Q: “why did the CX4-960 clone a large VM 5 times faster than their V-Max?”
A: “A VMware-level clone or deploy from template looks (at the array level) like a big sequential, large block, single I/O stream”
Furthermore, in going from ESX 3.0 to ESX 3.5 – the I/O size during large file copy operations increased dramatically – increasing to up to 32MB in some conditions. Now, some arrays do better with that, and some do worse. Typically mid-range array designs do better, and enterprise designs do worse.
Recognizing the symptom: this just shows up as “slow performance during clone (but not in steady state)”.
- vSCSIstats looks looks OK during normal IO operations (so it’s not just an “underprovisioned backend with insufficent spindles or meta objects”).
- ESXtop shows a low QUED (so it’s not a “microsbursting problem causing ESX queue back-off” – read up to understand that here).
- The back-end analysis of the array shows that the spindles aren’t working too hard, and neither is the front-end.
With this particular V-Max configuration, we measured the sequential I/O bandwidth with a series of I/O sizes:
- 1MB – 38MB/sec
- 512k – 38MB/sec
- 256k – 140MB/sec
- 128k – 135MB/sec
- 64k – 120MB/sec
This is with the same back-end storage configuration, just changing the I/O size. Also note that this is a single-threaded workload. In the “real world” the array is servicing many I/O streams from many hosts at once, and the upper total limits of a V-Max aren’t measured in MBps, but in many GBps. But, for any given I/O stream for a single LUN from a single host – you can see that bandwidth (which will determine the “amount of time to clone the VM”) – very large I/O sizes will reduce the bandwidth that can be acheived.
On a mid-range array, the reverse tends to be true.
The customer did their own testing with VMware and EMC working hand in hand and found the following:
- 256k with 4 jobs – 59 minutes (V-Max), 25 minutes (CX)
- 128k with 4 jobs – 33 minutes (V-Max), 25 minutes (CX)
If the number of clone jobs increased further, with the 128K IO size, I would expect that the V-Max would pull out ahead.
Now, before everyone starts to pile on….
- These were large virtual machines.
- the amount of time to clone (in both configurations) would drop by HUGE amounts if they use zeroed-thick (see post here).
- the amount of time to clone (in both configurations) would drop by LARGE amounts if they were able to drive the I/O to more front-end ports and directors (using round robin)
Ok – so, there is a good KB article on this here:
In this case, using this advanced property setting on VI3.5 dramatically improved their “time to clone” a VM. While this was an EMC Midrange to EMC Enterprise case, the same behavior would be generally true in most mid-range to Enterprise array models.
There’s an important epilogue here…
- I am not recommending that customers go out and make this change in general, but rather putting the info out there for education purposes. If you are not happy with performance and think something is wrong, call EMC or VMware support. We’ll open cases on both sides and resolve any issue.
- Decreasing the IO size meant faster clone – but there is a tradeoff. It means more IOps, which means more ESX host CPU consumption during the clone operation.
- The EMC engineering teams are working to make it so that this setting change would not be required in any case.
- This setting is not available in ESX 4.0. In vSphere, VMware completely changed the whole underlying part of the vmkernel code used for these mass data movement operations – a cool more modular design (called a Data Mover).
- In the NEXT version of vSphere, arrays can hardware-offload this data copy completely (this is ONE of the vStorage APIs for Array Integration APIs) – in otherwords the Data Mover recognizes a “VAAI compliant” array and then hardware offloads that block operation to the array, which can batch together the moves and do it much, much faster. NetApp and EMC Celerra have the ability to do this now for file-level objects on NFS datastores (since we can do file-level snapshots). This is something different – this will literally be a “hardware offloaded vSphere operation” (will apply to many operations – clone, template, storage vmotion, etc.)
Thanks – hopefully this helps some customers out there!