« A few musings on the road re Tools | Main | EMC ProSphere 1.5-Play, Learn, Try! »

March 21, 2012


Feed You can follow this conversation by subscribing to the comment feed for this post.


Sweet! We recently tried out to cut the links between to sites in a streched cluster when preparing demos for a fair, and had huge discussions on why it wouldn't work... since then we were very hot in waiting for the update :-)

Mark Burgess

Hi Chad,

This is great news.

Are there any other HA stretched cluster issues remaining?

I was told that a complete failure of a VPLEX cluster (very unlikely) would result in an APD and HA would not kick in.

Do you have a time-scale for when the remaining issues will be resolved?

Would you now recommend that customers deploy VPLEX VMware HA stretched clusters - I know in the past you felt that vSphere was not quite ready and most of the time you would be better off with SRM?

Many thanks

Lee Dilworth

I did post some comments but they've been lost in space :(

anyway..to answer the last comment. if your two sites are sync distance apart (which they need to be for most stretched setups) then usually some basic questions can help you figure out which solution might work for you.

- are the two datacenters / sites a few metres apart? miles? / on the same campus?
- if they are *really* close (metres/campus) if we put the DR solution in place would we still fail a DR audit as the recovery location is too close?
- do we have a flat network between the sites/datacenters?
- do we need the ability to test failover scenarios non disruptively?
- do we need a solution that provides non disruptive mobility?
- can we tolerate planned outages for the times we need to move workloads?
- do we need a solution that can support different address spaces at either site?
- do we have the knowledge and skills to understand the affects of partitions / site down events and the various failure scenarios associated with a stretch setup? and do we have the capacity to POC these properly?
- is our network setup to handle the stretched configuration? do we understanding traffic "trombone" and can we handle network partitions correctly that might affect the cluster?

if you are considering a stretch setup for any of the following reasons these are usually loud warning bells to me in customer meetings that trigger me to really ask more questions about the customers understanding of what they are getting into..so do any of these sound familiar:

- we want to be able to vmotion between sites
- the stretch solution is cheaper
- the DR solution is more complex
- the stretch solution does not require the same ongoing monitoring and maintenance as the DR solution so it seems easier
- what are the chances of losing both links between the sites
- HA is just like an instant failover like fault tolerance right? (yes i've heard that one)
- if the site goes down everything just vmotions over (yes i've heard that recently as well)
- sequencing of vm restart isn't that important, if applications don't start up or services fall over because dependent VM's weren't "up" yet we'll just login and start them or script that....make sure you do as doing this for 10 vm's might be fine....100? 500? 1000?

the point is a stretched or as one of my peers in the EMC VPLEX team (Olly I'm officially stealing your term) calls it, a federated HA setup is VERY different to a DR solution based on SAN replication where you rely on something like SRM to provide the orchestrated replay during a failover.

federated HA solutions or "vSphere Metro Storage Clusters" as we now call them in vmware are NOT just about the storage layer, they are just as much about the network layer, the vSphere layer and the workings of vSphere HA are VERY important to understand.

at vmware we are currently working on some papers to enable customers who do fall into the category of being suited for a stretched setup to have a better understanding of how to manage it correctly, how to design the objects in vCenter (datastores/datastores clusters/naming conventions) in such a way that it makes the inventory more logical for a stretch setup and also makes things like site locality simpler to enforce and configure and then once configured have an easier way to be able to see "what is where" at any one time. its not until you build one of these environment that these nuances become apparent. Trust me we did it recently with a small 30 VM setup and it got confusing...imagine if there were 10000 VM's!!!

so what else needs to be managed to ensure your cluster behaves correctly or has some kind of even site bias / balance:

- DRS affinity groups need to be setup and maintained as you are adding/removed VM's to the estate (for large setups really need to automate that into provisioning process for both hosts and vm's)
- datastore clusters are useful to help with locality
- hearbeat datastores should be increased from 2 to 4 and select 2 per site
- HA restart priority settings should be configured and maintained on an ongoing basis IF you want to maintain any kind of control over restart sequencing, remember you do not have a recovery plan style runbook as you would in SRM so you need to realise HA restarts are not that organised and in fact the restarts could be different and usually will be different every time as it will depend on what failed as to what vm's get restarted. this might be an issue if your apps have a specific startup dependency and your dealing with 100's or 1000's of vm's.

Bottom line. Stretched setups are not new. vSphere 5.0 includes features to improve the day to day use. Make sure you understand the differences between the two and choose what's right for you and most importantly of all what's acceptable to the team running the system. the success of the solution and the effectiveness of it in terms of protecting your virtual infrastructure will be directly affected by the ease of use and the "buy in" of the ops team. Without that, trust me it'll fall apart in days. I always think back to when I used to write scripts for DR. Scripts worked great on day 1, customer was happy and nodded enthusiastically when I pointed out naming conventions etc that scripts looked for during recovery....on day 2, naming conventions went out of the window scripts stopped being useful and were broken, recovery now massively at risk. This can happen in exactly the same way with HA solutions and DR solutions. Only the customer truly knows which one they feel they can honestly cope with and commit to.

The comments to this entry are closed.

  • BlogWithIntegrity.com


  • The opinions expressed here are my personal opinions. Content published here is not read or approved in advance by Dell Technologies and does not necessarily reflect the views and opinions of Dell Technologies or any part of Dell Technologies. This is my blog, it is not an Dell Technologies blog.