It’s kind of funny – in our small but vibrant VMware community, there is totally a certain “zeitgeist” where people simultaneously think about topics. Look at this:
- I have a tickler on my calendar that said “do a post on the new vMSC category”
- Today, I see that the always excellent Duncan Epping did a great post on vSphere 5.0 HA and metro/stretched cluster solutions.
- There’s also Lee Dilworth from VMware and I presenting on this topic at VMworld 2011 (and updating for VMworld 2011 in Copenhagen Oct 18-20) here.
- The always excellent Scott Lowe did a post on his Updated Stretched Cluster presentation
- I get an email internally where a customer is confused on this topic…
So – what’s driving this “synchronized zeitgeist”?
Answer – a lot of internal work that’s coming to a head, more and more stretched cluster deployments = these use cases are now becoming much more mainstream, and simpler.
1) There is a new VMware HCL category “vSphere Metro Storage Cluster”. This is the result of a LOT of work. I’ve created a link to it here. This is interesting – in the sense that prior to this, while there was a lot of vendor work, info, KB articles for this, there was no formal test harness for the configurations, and failure scenarioes associated with stretched vSphere clusters (look at Duncan’s post for a view of some of the major failure scenarioes). Together, we built a whole test harness that is now the standard for this, and will be very useful going fwd.
2) The VMware KB articles around this use case are now a LOT simpler. This is due to a ton of the changes in vSphere 5 HA and DRS behavior that Duncan points out. You can see the udpated KB article by clicking below. Remember, DRS host affinity rules are now supported in this use case when using vSphere 5, so go ahead an use ‘em.
3) Lots of work on hardening EMC VPLEX around this use case… (more combined zeitgeist – just in from the product team)
- GeoSynchrony 5.0 introduced the VPLEX Witness: Allows for discrimination between site failure and site partition which means that I/Os are automatically enabled on whichever side survives a failure event.
- update: Note – VPLEX GeoSynchrony is the software that runs VPLEX. VPLEX has 3 major operating modes for every device it presents – Local (in one datacenter), Metro (stretched across synchronous distance) and Geo (stretched storage models across asynchronous distance). As of right now, vMSC is not supported with Geo – only with Metro. VMware/EMC continue to work on this, perhaps I’ll do a post on the technical challenges. It works while there is no partition, but the partition scenarioes are very, very funky indeed, and EMC and VMware agreed that we want to keep it simple for now – Metro-distances only.
- VPLEX in GeoSynchrony 5.0 changed I/O Suspension behavior. vSphere 5.0 now recognizes this behavior and recognizes this as a PDL condition in vSphere 5 Net result when a split brain condition does occur, vSphere can now determine which side of a VPLEX is alive and therefore, restart VMs (and I/Os) on that side.
- Note - this still isn’t working 100% right, and will be updated in a future vSphere patch. Short version, the "vm kill on PDL" doesn’t work perfectly, and still results in APD in some corner scenarioes. We’re working on this very closely, but people can move forward with confidence without waiting. BTW, before you ask (a good question that was asked internally) – does enabling VM Failure Monitoring in vSphere HA fix this behavior? = no
Short version – look at the “what way should you go forward” guidance in Scott’s presentation, or BCO2479 (Lee and I) to determine the right solution for disaster avoidance/disaster recovery (vMSC or SRM). If vMSC is the right answer, move forward with confidence – it’s getting better and better every day.
Also, know we’re not stopping here. Lee and I discuss some of these things in BCO2479… Together, EMC and VMware are looking at how we could “surface” site bias via VASA, how we could make this automatically configure things like host affinity rules. We’re working to make multi-site scenarioes (rather than 2-site) work better, including hybrid solutions of vMSC for two metro-separated sites and 3rd async sites with SRM, as well as future multi-site vMSC scenarioes. Lots of great stuff!!!
Comments and input welcome – are you using or contemplating stretched vSphere clusters?