This is part 2 of a 3 part blog post. The first one categorized various “disaster avoidance + disaster recovery” solutions. The 3rd post will talk about where the R&D and product development is taking us in the world of Disaster Avoidance and Disaster Recovery.
BTW – thank you to many on the EMC and VMware teams who have given great input on post I and II.
I’d strongly suggest starting with “Post I” – which you can read here.
OK – are you back?
While my personal favorites from Post I lead you to either decide you want either:
- Simplest solution, cleanest support model, best DR testing, and best RPO/RTO across the broadest set of failure conditions (it’s hard to argue with that list)– and therefore pick “solution 3” – the “classic Site Recovery Manager” solution.
OR
- if you HAVE to have non-disruptive workload mobility, but want a flexible solution, offering both non-disruptive workload mobility and disaster recovery, you need to accept more complexity. SRM is OUT because non-disruptive vMotion demands a single vCenter domain (with ANYONE’S storage model).
REMEMBER – CHOOSE WISELY HERE. Yes, SRM by definition means that it’s disruptive, but many, many customers use SRM for a highly automated, highly testable (though disruptive) mass workload mobility tool – migrating many many VMs between datacenters.
If you have given up on SRM because it’s got to be nondisruptive - that in turn demands that you are willing and able to sign up for the extra complexity of scripting and solution maintenance. If you’re in this category anyway, I would personally recommend “solution 2” – the “multiple separate VMware clusters” approach. That’s covered in this KB article.
… But, I can’t deny the interest people have in “solution 1” – the “single stretched VMware cluster” in spite of all the downsides I pointed out in Post I (strongly suggesting I’m wrong :-).
Let’s EXAMINE why people dig “solution 1 – Stretched HA clusters” (remember – all these have pros and cons):
- Sometimes, customers don’t have a second site, but rather a single campus. In that case, you’re often accepting that if it’s a total site disaster, you’re SOL – but “building” disasters are possible.
- While site partitions are certainly more frequent than site disasters, they are a lot more infrequent than host or path failures. If that’s the case, this solution gets you something pretty good – host/path failures redistribute loads between sites as VM HA response kicks in.
- Versus SRM, VM HA this provides host-level (and potentially with VM-level monitoring – fully automated restart and granularity of restart (i.e. it is on a per host/VM basis not a full recovery plan being executed). Note that while OK for this level of granularity, “fully automated” tends to be a bit more dangerous when you’re talking about a full site disaster – hence the “operator hits the ‘declare disaster’” model of SRM.
So – this post dives into the topic of Solution 1 – Stretched VM HA clusters more specifically. The topic, including tested and supported behavior in all sorts of failure modes (in this case with EMC VPLEX) is covered in this KB article.
The post will cover not only the VM HA topics, but will also deal specifically on the topic of EMC VPLEX, the partition state, and what it means to vSphere VM HA behavior. These are inextricably linked, and if you are not using EMC VPLEX, I would strongly suggest reaching out to your storage vendor and looking for similar info.
The cue that you’re looking at the right docs as opposed to marketing is that the discussion includes the “not so nice” parts, and is filled with technical stuff :-)
Then let’s dive into it – read on.
There are two core thing to understand about stretched VM HA clusters:
- The worst state for any VM HA cluster (or any cluster for that matter) is a split brain – where both sides think they are authoritative and continue to operate. This is bad for the obvious reason think about (“hey the VM – or more generically, the service - would be running in two places – that’s not right!”) but the more important reason is that after they have been running distinctly for any time period, they have diverged. Once diverged, “re-merging” the service, the VM, or the information (the VMDK) is basically for all intents and purposes impossible.
Think of it this way: split brain is worse than an outage, because it creates an EXTENDED outage (as you try to unravel what is authoritative, or the “most correct/least incorrect”)
- You need to look at the storage model as a cluster underneath the VM HA cluster. The behavior of the storage cluster in the various partition states interacts with the behavior of the VMware cluster in various partition states. Also (and this really is critical): you can have a split brain at the host level (the host running in two places at once), the VMware cluster level and/or at the storage level, or any combination thereof. For example, you could have the storage cluster partitioned (the interconnect at the storage layer fails due to a failure or an administrative error), but the ESX hosts are all able to reach each other over the management network (and therefore are all still connected from a VM/HA perspective), then DEPENDING on the storage model – you may have SOME of the VMs on SOME of the ESX hosts fail.
Think of it this way: IF there were two states of a VMware cluster (working/partitioned), and two states of the storage cluster (working/partitioned) – you would have 4 possible “system states.”. And…The reality is that each doesn’t have 2 states, but rather MANY states.
Check. Split brain = bad. VM HA cluster and storage clusters = inextricably linked.
Add to this, of course, the network model (for both network and storage) between the sites to the equation.
Roger. This is NOT a simple topic.
Ok – let’s dive a little deeper. Every type of cluster you’ll run into uses one or more of a variety of methods to ensure that split brain never happens, but always must define this behavior BEFORE failure (as obviously on failure, there may be no communication between sides).
In the case of VM HA, this is a function of the default isolation behavior of VM HA (shutdown) and the modus operandi of VM HA primary and secondary nodes – on which there are many great sources… I would start here.
In the case of EMC VPLEX, you have a series of VPLEX nodes in a cluster, connected via an “intra-cluster communication mechanism”. This communication occurs over an FC connection, or FC/IP if you have an FC/IP converter (which can be had for only a few $K – check out this). In future VPLEX releases, this could be native IP.
Across the cluster, there is a distributed, coherent cache image, and it is used to present a virtual volume across all the ports, and all the nodes in the cluster (a scale-out model).
If you have two sites that are geographically dispersed and are using EMC VPLEX Metro (synchronous-class distances), you have TWO EMC VPLEX clusters in a “Metro-Plex”, and you create a distributed virtual volume that spans BOTH clusters (and depends on the “intra and inter-cluster communication noted above).
This is what a screenshot of the management UI looks like. You can see there are two VPLEX clusters (cluster-1 and cluster-2)
When you create a Distributed Virtual Volume – you define one of the sides as the “preferred site”, and one being the “non-preferred site”. The rule set (note, it’s a device/LUN level thing, not a cluster-wide thing) determines for that particular Distributed Virtual Volume, which cluster will detach from the Metro-Plex for that device, and therefore keep servicing IO. As an example, in the screenshot below, you can see that for Distributed Virtual Volume DR1_device1, if VPLEX cluster-1 and VPLEX cluster-2 partition – cluster-2 will detach from the Metro-Plex and continue to support IO for Device DR1_device1.
What happens to the “losing side?” - the “non-preferred site” takes those particular volumes out of a read/writable state and into a read-only state (in the example above, all the hosts using “cluster-1”) ensuring that there is NO possibility of split brain from an IO standpoint.
Ok, YES, I know the wording here is a little confusing. Let me put it this way: the “cluster that detaches” is the one that continues to service IO. The other one is the loser in a partition and will go into a read-only state.
Now, for people who don’t look at this for a living, there’s a subtlety that often gets missed. TODAY there’s no way for VPLEX (or VM HA for that matter) to differentiate a “failed inter-site communication link” (which happens disturbingly often) from “one site is a smoking hole” (which happens more rarely). Other options for this sort of thing are 3rd party “voters” (which have some sort of “keepalive” signal to the two sites – though 3rd part voters can add as much complexity as they eliminate).
That difference may not seem material, but it absolutely is. In the case of a “failed inter-site communication link” state, both sides have partially working infrastructure (with VMs up and running still), so the possibility of split-brain exists in that case, but not in the “smoking hole” state (where the ESX hosts and their VMs are all vaporized by the disaster).
BTW – HIGHLY recommended reading before going on is to understand VM HA and isolation response. Duncan’s posts on the topic are very good – so once again, if you haven’t… start with this one. Read well – with the exception of the non-supported options, which are good to know about, but VERY bad to use)
Let’s examine some scenarioes:
- A Scenario: you have an inter-site communication failure between the ESX hosts, but the VPLEX clusters are still happily talking. Then, you have VM HA isolation response start to kick in on the ESX hosts, and the storage is being presented at both sides.
- Another Scenario: you have an inter-site communication failure between the VPLEX clusters, but the ESX hosts are all still happily talking. Then, you have the VPLEX isolation response (called partition) kick in, which changes the behavior of the VPLEX presented storage on the non-preferred site.
So – what is the behavior of the VPLEX cluster when that partition state kicks in?
The non-preferred VPLEX cluster places the Distributed Virtual Volume that it is presenting immediately into a read only mode. This is called a “suspended” state – which ensure there is no chance of split-brain.
The downside is that in the VMware use case, this is not ideal, because this has the same effect as a datastore target being “yanked” (a path failure or catastrophic array failure) as opposed to being actively communicated as being “down”. The ESX host will NOT immediately shutdown the VM. There is zero chance of split-brain from a DATA standpoint, as the VM cannot write any persistent changes, but in the IDEAL case (hint, not the real-world case), the VM would instead immediately go offline in a more predictable way
So – what happens if a storage devices just disappears from an ESX host? Rather than explaining, here’s a video (out of box ESX host and W2K8 behavior:
You can download it in high-rez WMV and MOV formats.
Summary:
When storage is “yanked” from an ESX host, the VM guest behavior is best described as “indeterminate”. It will likely become unresponsive at the console. It may continue to respond to pings, and some services on some ports may still be running in some strange form. If you do a bus rescan, the storage device disappears, but even that won’t kill the VM by definition. Shutting down the VM (via vCenter) seems to only possible if the storage device is present and may not be possible. Wow. I bet that’s different than you expected.
What happens if a storage device becomes read only (which is what happens today with in a VPLEX non-preferred site on partition)? Rather than explaining, here’s a video:
You can download it in high-rez WMV and MOV formats
Summary:
When a storage device goes read-only, the VM guest behavior is ALSO best described as “indeterminate”. It’s basically the same as “yanking” the LUN/target.
Ok, so – what’s the net?
If you are SURE that you have a smoking hole – ergo the same circumstances that would make you hit the “run” button in Site Recovery Manager for a recovery plan, you’re declaring a disaster. At that point, you want the remaining side to become read/write.
With VPLEX, you promote the remote copy to read/write by issuing the following command for a given device:
"device resume-link-down –r <device> -c <cluster to unsuspend on> --force"
You can then finally successfully boot the VMs on the preferred site manually (or via a script). They will have an RPO of zero (it is a sync copy), and you’re no worse/better off from an RPO standpoint off than in any of the solutions from Post 1. BUT – the solution complexity is much higher than the basic Site Recovery Manager solution, making for a worse RTO (not to mention the fact that you give up the simple test process).
Do you see now why I highly recommend that for most customers the straight-up SRM approach is the way to go?
Note that VPLEX today doesn’t have an SRA for Site Recovery Manager. We’re trying to get one done, but hard to prioritize when the thing that people people WANT VPLEX for (non-disruptive site mobility) is mutually exclusive with Site Recovery Manager.
Making it all simple (and these two below are a binary choice):
- If non-disruptive site mobility is your priority, and you are willing to accept the complexity in these blog posts - you owe it to yourself to check out EMC VPLEX and it’s peers and how they work with VM HA or scripting.
- If your priority is a simple, feature rich, straightforward solution that covers a broad range of Disaster Recovery/Restart scenarioes, and are OK with Disaster Avoidance (VM mobility between datacenters in advance of disaster) being disruptive and “restart oriented – you owe it to yourself to check out VMware Site Recovery Manager and EMC array replication.
While Post 3 in this series deals with the longer-term “ideas we working on”, there’s nearer term stuff that is material.
A future version of EMC VPLEX Geosynchrony changes the default partition behavior in a vSphere environment, such that the ESX host sees the datastore marked as “unavailable” (all paths dead) immediately as soon as any IO is issued and therefore:
- making that ESX host not a candidate for VM HA restart. This is just means that multiple VM HA restarts aren't attempted, and only matters in the case where the VPLEX-inter cluster communication is broken, but the ESX hosts can all still see each other.
- Is expected to more immediately crash the VM. To be explicit, there is a corner case where the VM never issues an I/O, and the path is not marked as dead. This is one (of many examples) why we call the EMC PowerPath/VE “proactive path death” (a very poorly named feature) a “feature” :-) Continually testing path state, even when there isn’t an outstanding I/O is a good idea.
If I was a smart man, I would simply stop here, and ask “any questions?”.
…
…
BUT, I’M NOT, SO… IF YOU ARE LOOKING AT VPLEX, I’M GOING TO TRY PUTTING IT ANOTHER WAY (this is an engineer’s way of stating the various failure states today – don’t be intimidated, I’m just being explicit):
Behavior today:
- On VPLEX total failure (VPLEX has no SPOF, so would require multiple element failures) at the preferred site, or COMPLETE site disaster at the preferred site:
- VMs running on ESX hosts on the non-preferred side will stop responding in “non-deterministic ways” after a period of time, as the device goes into the “All Paths Dead” mode (remember – being DOWN is the desired behavior here – worst scenario is split brain)
- VMs on the preferred site will immediately will stop responding in “non-deterministic ways” (their storage disappeared) if there was a catastrophic storage issue, and of course if the site is a smoking hole, the VMs are not running on the preferred site, as the ESX hosts are all dead.
- The VMs will try to restart on a given ESX node, and since the selected ESX host for VM restart will always be on the non-preferred side, it will try 5 times on a variety of hosts in the cluster. Since all the hosts attempted are on the non-preferred side, it will fail.
- After ensuring that indeed, the preferred site is indeed completely failed or completely unavailable, the administrator can enable IO on the non-preferred side with the simple command: "device resume-link-down –r <device> -c <cluster to unsuspend on> --force"
- If the vCenter instance was running as a VM, and was running on an ESX host supported by the non-preferred side, it would need to be restarted on the surviving side (manually using the vSphere client to connect to a surviving ESX host directly, and restarting the vCenter VM). We often recommend considering vCenter running on non-distributed (i.e. local only) and using vCenter Heartbeat to be used to simplify this.
- For the remaining VMs (originally on the preferred side), the admin would then simply right click on the VMs and say “restart”.
- The whole process could be scripted (remember though, scripts are complex, and error prone – hence my general recommendation that for most customers, SRM is the best answer)
- On VPLEX partition + ESX cluster OK case:
- VMs running on ESX hosts on the non-preferred side will stop responding in “non-deterministic ways”after a period of time, as the device goes into the “All Paths Dead” mode
- If you have guest-level monitoring on, VMs will try to restart on a given ESX node, and if the selected ESX host for VM restart is on the non-preferred side, it will try 5 times on a variety of hosts in the cluster. If all the hosts attempted are on the non-preferred side, it will fail.
- VMs on Distributed Virtual Volume the preferred side continue to work fine.
- To formally disconnect the storage on the non-preferred side, admin could go in and remove the virtual volume from the cluster’s storage view, rescan across the cluster – which would mark all the VMs on that datastore as “unavailable”. The datastore not being present would stop any automated restart from attempting those ESX hosts in the future until VPLEX inter-cluster communication is restored.
- If the vCenter instance was running as a VM, and was running on an ESX host supported by the non-preferred side, it would need to be restarted on the surviving side (manually using the vSphere client to connect to a surviving ESX host directly, and restarting the vCenter VM). We often recommend considering vCenter running on non-distributed (i.e. local only) and using vCenter Heartbeat to be used to simplify this.
- For the remaining VMs (originally on the dead side), the admin would then simply right click on the VMs and say “restart”.
- The whole process could be scripted (remember though, scripts are complex, and error prone – hence my general recommendation that for most customers, SRM is the best answer)
- On VPLEX partition + ESX cluster partition case:
- VMs running on ESX hosts on the non-preferred side will stop responding in “non-deterministic ways”after a period of time, as the device goes into the “All Paths Dead” mode
- VM will try to restart on a given ESX node, and if the selected ESX host for VM restart is on the non-preferred side, it will try 5 times on a variety of hosts in the cluster. If all the hosts attempted are on the non-preferred side, it will fail.
- VM will ALSO try to restart on a given on an ESX node, and if the selected ESX host for VM restart is on the preferred side, it will succeed.
- VMs originally on Distributed Virtual Volume on the preferred side continue to work fine, so long as the ESX hosts can communicate with the other nodes on their side (and therefore don'.
- To formally disconnect the storage on the non-preferred side, admin could go in and remove the virtual volume from the cluster’s storage view, rescan across the cluster twice (a vSphere thing) – which would mark all the datastores as “unavailable”, though this doesn’t really matter.
- If the ESX hosts can’t communicate on the remaining nodes on their side VM HA host isolation response kicks in, and one of the remaining nodes on that side will win and restart the VM .
- If the vCenter instance was running as a VM, and was running on an ESX host supported by the non-preferred side, it would need to be restarted on the surviving side (manually using the vSphere client to connect to a surviving ESX host directly, and restarting the vCenter VM). We often recommend considering vCenter running on non-distributed (i.e. local only) and using vCenter Heartbeat to be used to simplify this.
- For the remaining VMs (originally on the dead side), the admin would then simply right click on the VMs and say “restart”.
- The whole process could be scripted (remember though, scripts are complex, and error prone – hence my general recommendation that for most customers, SRM is the best answer)
Phew… Thanks for sticking with me, and hope this is useful to folks.
As always, great article. I'd been working on this topic with the HP SVSP and you're right on. Everyone wants to talk about the magic you get, but what happens in the failure scenarios is much more interesting and less understood. Thanks for shining a light into the dark corners -- even though there may be monsters!
Just a note of a typo ...
"There are two core thing to understand about stretched VM HA clusters:"
should be 'things'
feel free to apply appropriate moderation to this comment :D
Posted by: Doug B | December 17, 2010 at 10:29 AM
spotting typos is not difficult ;)
* VMs on the preferred site will immediately will *
should be
VMs on the preferred site will immediately
regards.
very well written, btw!
Posted by: rKahn | April 16, 2014 at 09:57 AM
another typo ..
*VMs on Distributed Virtual Volume the preferred side continue to work fine.*
should be
VMs on Distributed Virtual Volumes on the preferred side continue to work fine.
regards
Posted by: rKahn | April 16, 2014 at 10:00 AM