This is the sister post to the one on a new bandwidth record – here.
At EMC’s mega-launch in January, we commented that 2011 would be the year of EMC breaking records. We weren’t kidding. See here, here, here, here, here, here, here, and so on,….
So.. With the release of every major vSphere release, the EMC and VMware performance engineering teams get together and brainstorm: “what would be a ridiculous, over the top test to see where the performance envelope is today?”
This time – the gang at VMware (Chethan Kumar and others) and EMC Symmetrix Performance Engineering (Dan Ahroni and others) said “lets see we can break the 1,000,000 IOps barrier”. After all, you get to keep saying “one meeeellion” with the Dr. Evil pinky :-)
So they got cracking in the lab in Hopkinton. A few weeks later – here you have it:
The new world record benchmark for IOps through a reasonable vSphere 5 configuration is 1,000,000 IOps.
That’s around 4x the previous record. This is the story behind the story :-) Read on for more.
HOWTO: breaking 1,000,000 IOps through a single vSphere 5 host:
- Step 1: Get vSphere 5.
- Step 2: Get a monster server.
- Step 3: Make sure there’s no network bottleneck.
- Step 4: Get a transactional storage system that can scale out to leverage x86 and cache at scale with very low latency.
- Step 5: Put engineering teams in a room for 2 weeks – let sit.
Step 2 is a trick – you need a lot of cores, and a lot of oomph to drive this amount of IOps. Monster VMs run well on Monster Servers :-)
We reached out to Intel who nicely sent us a host up to the task. The test was completed on a server with these specs:
- vSphere 5 RTM code
- Four 2.40 GHz Ten-core Processors, each with: L2 Cache: 10X256 KB, L3 Cache: 30 MB
- System Bus Speed: 6.40 GT/s
- System Memory Size: 256.0 GB
- System Memory Speed: 1067 MHz
- Six dual port Emulex LPe12002-M8
- LightPulse X86 BIOS Version 2.02a1
If you do the math: that’s a total of 96 GHz, 10MB of L2 cache, 120MB of L3 Cache – in one server. Yikes.
Step 3: To eliminate any possible network issue (this is an IOps and stress test, not so much a bandwidth test), we went trusted and reliable from our friends at Emulex, and Brocade. This is the fast and dirty cabling job (as you can see, this is a performance engineering lab, not a cosmetic lab):
Step 4: Get a transactional storage system that can scale out with x86 to more than cover the massive host and use a distributed cache model at scale with very low latency… Hey, sounds like a job for VMAX :-)
For those of you that don’t know VMAX – an 8 engine configuration is, in effect, a cluster of 16 “IO” Servers attached to disks (they are in pairs). This configuration had a 1TB of cache – that’s a lot. Here’s the “heat map” of the VMAX under that load.
To say it wasn’t “getting it’s heartrate up” would be a stretch, but it wasn’t really breaking too much of a sweat. You can see headroom in the front end and backend CPUs, and the front end and back-end ports. The disks were hot, but they were just keeping the cache warm. The average in guest latency was about 1ms. Wow. Look – that’s a “from cache” number, but also highlights the efficiency of both vSphere 5’s IO stack (so little latency introed) and the ability of the VMAX to scale it’s cache effectively.
For the IOmeter workload, here’s how it was configured:
- 16 or 32 outstanding IOs
- The number of VMs running IOmeter were varied over the tests.
- 100% random and 100% read requests
- The size of the I/O requests varied between 512 bytes, 1KB, 2 KB, 4KB, and 8KB depending on the experiments.
Net?
What’s 1,000,000 IOps? A LOT.
The gory details of the test are on the VMware Performance engineering blog (which is always posting great content – check them out often) here:
DISCLAIMER: Remember – this isn’t intended to be a “realistic workload”. It is intended to simulate a realistic workload. Realistically – there are few workloads that on a single hosts generate this kind of sustained IOps (yes, yes, they exist of course – but let’s be honest, they are rare). It’s not an irrational workload (for that we would have picked much smaller IO sizes). It’s the kind of workload that stresses both vSphere 5 (including the IO stack), the network, and the EMC VMAX.
It’s also worth pointing out that:
- While the VMAX wasn’t maxed, vSphere 5 was able to drive it to near saturation.
- That this was done without weird guest, ESX host, network, or array tweaks. A good config, but not a tweaked config.
Why do we do this? Well, first of all, because it’s fun, and kinda cool :-) Second of all – what we’re highlighting is that when people say “I can’t virtualized workload ____ because of IO”, the actual limits are so far beyond the realm of mortal workloads, people should virtualize the things that matter with confidence.
Now – all kudos here go to Chethan and team, and Dan Ahroni and team!

It's really fun and amazing to see this "record break" on theses technologies.
BTW, your link for PDF on http://www.vmware.com/files/pdf/techpaper/1M-iops-perf-vsphere5.pdf is broken
Posted by: Ammesiah | August 31, 2011 at 02:42 AM
It only required 960 disks in a VMAX to do this? Fusion-io can do 1,000,000 IOPS on a single card... in a single server. :)
Posted by: Simon Williams | August 31, 2011 at 09:07 PM
Waiting for the RAMSAN post about welcome. You know it's coming :). lol.
Good Job Chad. Awesome stuff once again. Bone-crushing performance. Was PowerPath / VE used? Would it have incurred more latency? Would it have helped the numbers? Just wondering :).
Posted by: Chappy | September 01, 2011 at 12:44 PM
How can one view the usage of backend CPU on a VMAX? I suspect we have some problems here on our systems.
Tore
Posted by: Tore | September 02, 2011 at 03:28 AM
Note - the link above for the study on VMware's site is broken... use the following link instead:
http://www.vmware.com/files/pdf/1M-iops-perf-vsphere5.pdf
Posted by: Tom Twyman | September 06, 2011 at 10:00 AM
@Tore - it's pretty easy. you can use Symmetrix Performance Analyzer. If you're stuck, please open a support case!!!
@Tom - thanks - link fixed.
@Chappy - you bet - and sure enough, right before you, FusionIO piped up :-) Hey, we're all predictable :-)
@Simon - thanks for the comment. The "single server" thing is the archilles heel in some use cases, isn't it. This is why I think any vendor who says flash in one way, in one place is missing it. when it comes to simplest path to pure IOps, and low latency, there is no substitute for host-based flash (commodity the intel way, commodity the OCZ REVO way, the Fusion IO way, and the EMC Project lightning way).
The downside is that if the info is "captive" to the host, you lose a lot. You lose a lot of VMware function, you lose common models for data protection. And, of course, in the real world, workloads tend to be all over the map in terms of performance needs - and even vary over time. These are the use cases where shared flash models make a lot of sense. Thanks for the comment!
Posted by: Chad Sakac | September 08, 2011 at 10:11 AM