The VMworld 2010 Hands-on-Labs (HoL) were EPIC this year. By now, everyone has heard the final stats:
- ~140,000 VMs created and destroyed over 4 days
- ~5000 labs done
- ~5 people who completed all 30 labs – about 30 hours of lab time = 5 people who spent 8 hours each of the 4 days in the lab.
It’s also known that the VMware HoL was supported in a “hybrid cloud” fashion – using an on-premise set of kit (in Moscone there in San Francisco), but also public clouds from Verizon (in Reston, VA) and Terremark (in Tampa, FL).
You can read more on Duncan’s blog here.
Feedback I heard from most folks was that it was a very positive experience – better than last year.
Like any of these sorts of things – it wasn’t perfect. The batphones started ringing furiously on Sunday (GETO crew calling all partners, and the EMC phones were ringing). There were weird performance issues throughout the week (never entirely pinpointed), and the WAN links were undersized for all the PCoIP traffic, so periodically they would be slammed. BUT all that said, it was an amazingly positive experience for VMworld attendees – I think most didn’t even notice the things we all agonized over :-)
Anyone who has ever supported one of these knows that something ALWAYS goes wrong, so it’s more about how: a) minimize risk; 2) how you roll with the inevitable snafus; and 3) working together to resolve it in the end.
Dan Anderson from VMware covers it all well here:
My hats off to VMware. Specifically to Mornay, Dan, Tim and all the supporting crew, and to the folks who build the LabCloud tools used for all the scheduling and orchestration (Clair and Curtis).
Now – like every VMworld HoL – there’s a mass contribution from VMware’s partners in support of the effort. We all step in knowing that we take a backseat. Our job is to do our best to make VMware look good, and do it behind the scenes.
In this case, EMC’s primary contribution were three very powerful EMC Unified NS-960 arrays providing both NFS and VMFS storage, and leveraging everything we could possibly provide, including all the FLARE 30/DART 6.0 goodness including huge mega FAST Caches, FAST (automated tiering), boatloads of 10GbE interfaces and solid-state out the yin-yang.
The exact configurations for each of the 3 locations:
- FLARE: 04.30.000.4.008
- DART: rev: 6.0.36-2
- SPA Read Cache Size = 433 MB
- SPB Read Cache Size = 433 MB
- SPA Write Cache Size = 8,665 MB
- SPB Write Cache Size = 8,665 MB
- FAST Cache configured in Read/Write mode = 2,000,000 MB (2TB – 10 x 200GB solid state disks)
- 4 10GbE Interfaces (2 datamovers, each with 2 x 10GbE, configured in an LACP config)
- Bus 0/0 15x450GB FC = 6.75TB RAW
Bus 1/0 15x450GB FC = 6.75TB RAW
- Bus 2/0 15x2TB SATA = 30TB RAW
- Bus 2/1 15x2TB SATA = 30TB RAW
Bus 3/0 15x200GB EFD (10 for FASTcache) = 3TB RAW
That’s a total of 229TB RAW capacity across the 3 locations. Although, as anyone knows, it much more about performance than it is about capacity.
So – what exactly did the arrays do? Well – a couple of the guys on the vSpecialist team (Clint Kitson) whipped up a tool that captured all the stats (we also used Clint’s tool in the EMC HoL to have a running public tally of what all the kit in the EMC booth was doing). Here is the capture as of 5pm on Thursday as everything was winding down (look at how balanced everything was across the 3 datacenters!).
- globalVol Write Requests 4,898,699,700
- NFS Op v3Write NFS Op Calls 2,993,986,770
- globalVol Read Requests 2,741,607,090
- NFS Op v3Read NFS Op Calls 4,680,130,890
- Total Network Traffic 150.412 TB
- Total Disk Traffic 88.942TB (39.2TB read, 49.74TB write)
- Avg Read size 19.48KB
- Avg Write size 8.6KB
- Avg CPU usage 18% (not too much heavy lifting)
How much of an effect was the FAST Cache? MASSIVE.
Here’s a Heatmap (utilization of all the elements in the array) from one of the sites:
There were no forced flushes (were the L1 write cache hits a watermark and forces a flush to disk – at which point the host response becomes the backend response) – there’s NO WAY we could have supported that workload with that number of spindles without FAST Cache.
The only blemish is that the FC interfaces were maxed out for a good chunk of the time. For next time, we can balance the backend config a little more.
Lastly – while it was a broad team effort on behalf of EMC to support our partner VMware, my hats off to a few folks on the EMC team who were instrumental. Chris Horn, your work with the VMware GETO team was fantastic. Stephen Spellicy, your work coordinating everything, working on demos and pulling the whole thing off – incredible, super-human. Clint Kitson and Nick Weaver – you guys are both so good, you are officially in super-freak category. I’m incredibly proud of the whole team.