DEC 15th, 2010 – updated a bit based on feedback.
Interest from customers on EMC VPLEX is very, very hot, and that’s keeping us very, very busy. Interest highlights that the core idea of active/active geographically dispersed transactional storage (“access anywhere”) is something people gravitate towards as they dream their next “active/active” virtualized datacenter dream.
The most common use case I see (admittedly, this will be biased towards things my team and I support) are stretched vSphere clusters.
The topic of geographically stretched vSphere clusters complex topic, and sometimes the customer doesn’t understand what they are asking for specifically, and sometimes we (the vendor community) don’t do a good job explaining the complexities involved.
Inter-site vMotion also makes for a SUPER COOL demo - and everyone loves a cool demo :-) Couple that with the first paragraph of this post (the core idea resonating), and you have every sales person’s dream – customers who start reaching for their checkbooks…
Put that all together – stir in new technology and new ideas that people don’t fully understand, and you have a recipe for bad stuff to happen :-)
Ok – most of the stuff from my article WAY BACK WHEN (still worth a read – wow 2.5 years later) around geographically stretched clusters still holds true.
I keep repeating myself (look in the QnA from this first big VPLEX post) – but it bears repeating: stretched vSphere clusters are possible, but you REALLY need to know what you’re doing.
At VMworld Copenhagen, I met with the VMware High Availability team (responsible for VM HA, VDR, SRM) over dinner. we discussed at length the confusion that exists in the market between disaster recovery, disaster avoidance, and the common refrain of “I could just script that”). Then we moved on to a more productive discussion – realizing that we collectively (VMware + storage vendor community) weren’t helping by not being explicit enough about the solution pros/cons, as well as the longer term direction EMC is taking.
Here’s my first stab at trying to helping and clarify – but it’s a long topic, so I’ve broken it into 3 posts.
In Part I of this post: I’m going to try to summarize the 4 major categories of Disaster Avoidance (“I know the hurricane is coming”) and Disaster Recovery (“oh oh, the hurricane hit, and my datacenter is dead”) and do it in a way that applies to multiple vendor solutions in this space.
This is important – I’ve found that with customers, they mentally seem to just “blend these all together” (heck, people mentally blur VM HA and vMotion!), but they are VERY distinct choices. Like all design choices - these 4 options always involve some trade off (doesn’t everything!). If you’re interested, I’m going to outline them in the simplest way I can, and I’ll highlight the pros/cons,
In Part II of this post: I’m going to explain what you need to know IF you pick the stretched vSphere cluster using VM HA using EMC VPLEX, with a particular focus on the “partition” state. Much of this will apply broadly to the category. I don’t claim to be an authority on other vendors’ storage platforms – but many of the considerations are intrinsic to vSphere.
In Part III of this post: I’m going to explain what the next couple of years look like in this space (both product development and R&D). As the VMware team and I discussed that dinner, we need to show people the path we’re working on if we want them to understand our advice today, tomorrow, and in the future.
It’s a nerdy (and long) journey – but if you’re in the market for vSphere in two or more sites, and looking for vMotion between sites AND disaster recovery, read on. I’ll try my best to keep it focused – it’s not a “sound bite” topic.
Ok – the 4 major “solution categories” around HA, DR, and workload mobility with vSphere…
Remember that this is a discussion around trying to achieve a COMBINATION of disaster avoidance and disaster recovery.
Let’s quickly start with some critical semantics:
Disaster avoidance is fundamentally about non-disruptive workload mobility. vMotion today helps you avoid a disaster where you know in advance that a server is going to need maintenance. Likewise, Storage vMotion does the same thing for the storage subsystem. If you can do these across datacenters, you can deal with “datacenter wide” disasters. This also enables (in theory) workload balancing across datacenters.
Disaster recovery is by definition a disruptive, unplanned event, which focuses on:
- How long it takes to restart, aka Recovery Time Objective (RTO)
- The state of the system post restart and acceptable data loss. This can be a function of the application SLA, or a core limit of your infrastructure. Ergo, the amount of dataloss can can be zero if local or remote but synchronous, or > zero based on speed of light limits where you are recovering at a distant remote site. This is also known as the Recovery Point Objective (RPO)
- Simplicity/repeatability of the disaster restart process. While “KISS” is always a good principle, when you’re talking about disaster, and the poop has hit the fan and you may not have all your brightest people right there when you need them – simplicity of disaster restart is VERY important.
- There are other factors: sometimes the RPO models or recoverability requirements entail complex multi-site scenarioes; there’s the question of application-level restart behavior, and much more).
It is important to realize that VM HA is an example of a recovery technology. VM HA has been designed from the ground up for a very specific scenario - a SERVER failing. In a sense, if you consider a host failure as a very “small scale disaster”, you can think of VM HA as “in the class of” host-level disaster restart technologies. Site Recovery Manager is a disaster restart technology – but one designed from the ground up for what’s more widely considered a “disaster”– a SITE failing.
So – in MY opinion, and experience with customer, the variations on “geographically stretched vSphere use cases” are an attempt to have your cake and eat it too – site-to-site disaster avoidance (move a workload between datacenters) and simple disaster recovery.
It’s been noted that there are other aspects to this:
- SRM can be used for non-automatic (in the sense that an operator invokes a recovery plan, which is then very automated with more sophisticated workflow handling) mass VM mobility (though notably always with the VMs being restarted) whether servers have failed or not – consider the case of moving VMs en masse from one datacenter to another.
- You can have all sorts of recovery plans, VM HA behavior works one way, consistently
- VM HA on the other hand, is totally automated – which of course has pros and cons itself.
- There are MASSIVE network considerations (if you can’t have L2 adjacency
Here are the solution types I see – and I will try to expressed in vendor generic ways.
Solution type 1: Stretched Single cluster with “magic geo-dispersed storage” (aka the category VPLEX is in)
BTW – not to imply with “magic” that it’s a cure-all, as you’ll see there are as many cons as pros. Rather, that this type of storage is a funky new category who’s behavior is different than what most are used to.
- It looks like this:
- for disaster avoidance you gain workload mobility (vMotion between sites) – sweet! This is done by simply doing VMotion between ESX hosts in the cluster (“intra-cluster vMotion”)
- One upside of this approach is that intra-cluster vMotions can be highly parallelized – and more and more with each passing vSphere release. With vSphere 4.1 it’s up to 4 per host/128 per datastore if using 1GbE, and 8 per host/128 per datastore if using 10GbE (and that’s before you tweak setting for more, and shoot yourself in the foot :-)
- You need to meet the vMotion network requirements (this is a topic long enough for other posts – and there are others out there like this one – both on what the requirements are, interesting variations, and what works vs. what’s the support state)
- for disaster recovery VM HA offers a “poor man’s” disaster recovery capability with some scripting and human intervention:
- you must forgo the option for more robust disaster recovery solution (VMware Site Recovery Manager) because SRM currently depends on a given structure (two vCenter instances in a paired model) which in turn negates the value of “non-disruptive workload mobility” (since vMotion is inter and intra VMware cluster, but always in a single vCenter domain). This is an important idea to internalize: VMware SRM is mutually exclusive with inter-site vMotion.
- The VM HA (or scripting – but most customer that are looking at this option like it over option 1 because they think they can “just use VM HA!”) approach translates to the same Recovery Point Objective (RPO) as a sync replica, but a LONGER Recovery Time Objective (RTO) in disaster situations (I guarantee a script will never be as tight in practice and over time as SRM would be), and loss of simple/automated testing. People who take DR seriously know simplicity, automation and testing are very valuable.
- You MUST plan a recovery model for vCenter (unless you’re willing to lose a ton like DRS pool configs, dvSwitch info and more - ouch). Common options are to virtualize it and replicate along with the VMs or use vCenter Heartbeat. When using the “vCenter = VM on array” approach, you should consider using consistency tech at the array level to ensure that vCenter is consistent with your data copy (i.e. all the VMDKs).
- The storage vendor often points out that you don’t need to pay for SRM, but often neglects to mention that you have to spend the similar (or more) on the storage piece of the design. In my experience, it’s about a wash.
- You need to think about all sorts of VM HA internals. Remember – VM HA was not fundamentally designed with this use case in mind. Future VM HA version may take this category of cluster into consideration but for now, the fact that it doesn’t manifests itself in all sorts of weird ways. As examples:
- This is why this category of solutions, when built right, always specify 8 ESX hosts as a maximum today. This is a number (4 on each side) where you are guaranteed to have the necessary VM HA primaries on each side in the case of a partition scenario. There is no VMware-supported way to control VM HA primary selection, and there are very good reasons to not use non-supported approaches (and yes, I know something specific here).
- vSphere has no idea about “sidedness of the storage”, and therefore doesn’t treat initial placement of VMs, VM HA response differently based on “side”. In other words, if the behavior on various partition states is non-symmetrical (and it rarely is), that means that all those various failure modes result in different behavior (will discuss the VPLEX example in the second post, but have looked at all types of solutions in this space, they all behave differently in different failure modes).
- The DRS Host Affinity rules can help, but again aren’t designed fundamentally for this use case. As an example (and Scott talked about this at VMworld 2010 in TA8101, which you can read here):
- DRS Host Affinity rules when set to be “Hard (Required) govern: a) Initial VM placement on start by DRS; b) automated vMotions initiated by DRS; and c) VM HA failover. Sounds great, right? A foolish storage-centric vendor who doesn’t understand VMware and is just trying to sell you something (this could be us, or anyone) would say YES! Well, then consider….
- DRS Host Affinity rules do not govern: a) Primary/Secondary node selection (hence the weird “8 node” thing discussed earlier); b)
VM HA admission control. That second one is a doozy. That means that you you either: i) turn admission control off – generally a bad idea – and “just make sure” that you run your cluster at 50% utilization (or whatever you are willing to accept oversubscribing the other site in case of disaster); ii) leave admission control running, but then make sure that in a declared disaster you disable admission control (otherwise VM HA will correctly NOT restart VMs). Of course, at that point, you’re manually starting the VMs anyway – because by the time that a human changes the VM HA option you have a whole whackload of failed VMs. Notice that this means that the original assumption of “VM HA will take care of restart” is only true if you accept running at <50% utilization on your hosts. OUCH. This is the ultimate example of: just because it worked once in a carefully controlled set of circumstances (often how vendors construct PoCs) doesn’t mean it’s always going to work.It’s been pointed out that I’m in error here – VM HA admission control only governs starting a VM, not restart to recover from an ESX host (in otherwords, you can’t add a VM that exceeds the admission control, but it won’t stop VMs from starting if the cluster has more ESX failures). I still submit that VM HA isn’t a great site disaster restart solution (particularly when compared with SRM). The VM HA team shouldn’t feel bad about that – it’s a host failure restart solution, not a site restart solution. As examples (and there are more) the VM HA restart sequencing is very primitive relative to SRM (so dependencies aren’t handled)
- The big upside is an “active-active” datacenter design where workloads can move non-disruptively, but the price you pay is for a less robust (not due to any given tech, but how the solution strings together) disaster recovery solution. Frankly, I would only personally do this if the following statement is acceptable:
“I’m going to AIM to automate restart, but I’m OK with having the data in case of disaster and assuming totally manual restart”
Solution type 2: Multiple vSphere Clusters with “magic geo-dispersed storage” (aka the category VPLEX is in)
- It looks like this:
- for disaster avoidance you gain workload mobility (vMotion between sites) – sweet! This is done by simply doing VMotion between vSphere cluster (“inter-cluster vMotion”). people often forget that this is fine – you can do inter-cluster vMotions – the “domain” of a vMotion is a single vCenter domain.
- One gotcha is that whereas intra-cluster vMotion can be parallelized, inter-cluster vMotions are still serial. It involves additional calls into vCenter during the process – so this is a hard limit (for now).
- Additional note (thank you VMware team): There’s also a loss of cluster properties for a VM when you vMotion between clusters – HA restart priority, DRS settings, etc.
- You need to meet the vMotion network requirements (this is a topic long enough for other posts – and there are others out there like this one – both on what the requirements are, interesting variations, and what works vs. what’s the support state)
- for disaster recovery your only choice is scripting.
- To reiterate – obviously VM HA is NOT an option (because VM HA is a cluster-level function), so scripting here is mandatory. Sounds like a knock against Solution 2 (multiple clusters) vs. Solution 1 (single cluster), until you think about the point made in the discussion about Solution 1 (single cluster) – VM HA is only really dependable as an automated restart mechanism if you force yourself to run at a really low utilization, and all the failure modes of your storage are totally symmetrical across all failure cases (which they never are). Ergo – in practice, you’re REALLY in likely in the same place with option 1 or 2 – manual or scripted restart.
- you must forgo the option for more robust disaster recovery solution (VMware Site Recovery Manager) because SRM currently depends on a given structure (two vCenter instances in a paired model) which in turn negates the value of “non-disruptive workload mobility” (since vMotion is inter and intra VMware cluster, but always in a single vCenter domain). This translates to the same Recovery Point Objective (RPO) as a sync replica, but a LONGER Recovery Time Objective (RTO) in disaster situations than SRM.
- Why do I say the RTO will be longer than SRM? After all – in both cases, the long pole in the restart process won’t be the script, right? It will be the VM restart (and any customization needed) – the same as SRM… Well, the reason I’ll say the RTO will be longer is I guarantee a script will never be as tight and accurate in practice. That’s particularly true over time as the configuration drifts. That’s putting aside the loss of simple/automated testing. I meet DR newbies all the time who say “I’ll just build a script!”. Yikes. Those same people then have disasters and run into an epic fail when they realize they haven’t updated the script in eons. People who take DR seriously know that automation and testing are very valuable. If you can commit to continuously updating your script and doing regular testing, then you’re serious enough to be legitimate when you say “I’ll just build a script”.
- You MUST plan a recovery model for vCenter (unless you’re willing to lose a ton like DRS pool configs, dvSwitch info and more - ouch). Common options are to virtualize it and replicate along with the VMs or use vCenter Heartbeat. When using the “vCenter = VM on array” approach, you should consider using consistency tech at the array level to ensure that vCenter is consistent with your data copy (i.e. all the VMDKs).
- The storage vendor often points out that you don’t need to pay for SRM, but often neglects to mention that you have to spend the similar (or more) on the storage piece of the design. In my experience, it’s about a wash.
- You need to think about all the startup dependencies of all your VMs.
- The big upside is an “active-active” datacenter design where workloads can move non-disruptively, but the price you pay is for a less robust (not due to any given tech, but how the solution strings together) disaster recovery solution. Frankly, I would only personally do this if the following statement is acceptable:
“I’m going to AIM to automate restart, but I’m OK with having the data in case of disaster and assuming totally manual restart”
Solution type 3: Two vCenter instances, each with their own cluster with “any ol’ replicated storage that supports VMware SRM”
- It looks like this:
- Note: While today, the replication is done exclusively at the array level (for VMFS, NFS and RDMs) when using SRM, VMware has previewed that in the future they may support native host-based replication models for customers who are not a fit for array-based approaches.
- Note: simple fan-in models are possible, but SRM still expects a basic paired “protected site/recovery site” model.
- for disaster avoidance you have disruptive VM mobility. Basically, copy the VM, deregister, and register (and update your SRM recovery plan). It can’t become non-disruptive until inter-vCenter vMotion is possible – which isn’t going to be for a while.
- for disaster recovery you can choose VMware Site Recovery Manager, who’s RPO is as low as your storage can make it (sync, async or continuous), and who’s RTO is as good as it gets (will be gated by the restart sequencing and number of VMs). It also involves no complex scripting (which is never just a one-time task), and offers very rich testing capabilities.
- You don’t need to worry about a complex process of replicating vCenter in a manner that is consistent with the RPO of the VMDKs.
- You lose the coolness of active-active datacenters, but on the other hand, you get the best RPO/RTO across a broad SET of use cases.
- Perhaps the biggest upside is that SRM’s programmatic method for handling IP address changes means that the complex network dependencies that exist in the other scenarioes.
Solution type 4: N sites, N vCenters, and complex multi-site RTO/RPO requirements
- These, well it’s impossible to say “what it looks like”. They come in every funky scenario under the sun – some being cascaded multisite, some being complex fan-in/out models. Here’s one example:
- Ok – these are hairy, and may seem “out there” but they come up all the time at larger customers. Often, the customer has a “bunker site”, where they don’t have the ability to support the datacenter and apps they are trying to protect, but there is a strict requirement for a zero data-loss copy of the data. So – they replicate the data to that site synchronously, and then have a second cascaded leg with a different RPO.
- for disaster avoidance:
- In GENERAL: you have disruptive VM mobility. Basically, take the VM down, copy the VM to a new target datastore (svmotion is not an option), deregister, and register (and update your DR script). Ouch. Zero chance (ergo don’t bother asking :-) of being non-disruptive until inter-vCenter vMotion is possible – which isn’t going to be for a while.
- A slight evolution of this is when VPLEX Global (multiple sites, sync/async) becomes available (public date not stated yet but will likely be well before inter-vCenter vMotion), it will an option to do this operation much faster (though still disruptive, but only for a fraction of the time) - just take the VM down, deregister/register.
- One VARIANT today that can deliver non-disruptive VM mobility is if you deploy this as a variant of solution type 2 – where you use a EMC VPLEX distributed virtual volume (sync today, async tomorrow) over the sync distances, and a traditional array replica of one side of the distributed virtual volume to the cascaded site. We’re working these for some customers, but very rare.
- In GENERAL: you have disruptive VM mobility. Basically, take the VM down, copy the VM to a new target datastore (svmotion is not an option), deregister, and register (and update your DR script). Ouch. Zero chance (ergo don’t bother asking :-) of being non-disruptive until inter-vCenter vMotion is possible – which isn’t going to be for a while.
- for disaster recovery, outside the basic 1:N and N:1 capabilties of SRM these days, SRM is out. As soon as the customer has a cascading replication/recovery model – remember: forget force-fitting SRM. That doesn’t mean you can’t do DR, but you need to do it the old-fashioned scripting way (note my advice above applies – I would ask any customer to really understand what maintaining and testing their script over time will entail). My advice – rather than trying to force-fit SRM into something it’s not, really test with yourself/with the customer whether the cascading replication requirement is a REAL requirement, or a “nice to have”.
- You MUST plan a recovery model for vCenter (unless you’re willing to lose a ton like DRS pool configs, dvSwitch info and more - ouch). Common options are to virtualize it and replicate along with the VMs or use vCenter Heartbeat. When using the “vCenter = VM on array” approach, you should consider using consistency tech at the array level to ensure that vCenter is consistent with your data copy (i.e. all the VMDKs).
Phew. I’m summarizing these four options in this table:
I think “Type 3” is generally the best choice for customers. If you ABSOLUTELY must have non-disruptive workload mobility between datacenters, you immediately start trading off for more complexity in DR. At that point, personally, I tend to recommended Type 2.
So – perhaps after all that, it might be more clear now why the VMware team and I cringe a little every time we see someone entertaining a stretched cluster config. In itself, the goal of a geographically stretched cluster is OK – but that same customer is always thinking “hey, this will be awesome and cheap, I’ll just use VM HA for restart and I’ll get non-disruptive datacenter mobility!” That’s a case of us as the vendor community oversimplifying things a little too much.
That ALL SAID…
We are getting many requests for stretched VMware clusters (Type 1 in the table) using EMC VPLEX – which suggests that non-disruptive datacenter workload mobility is more than sexy, it’s important for customers (will do a survey on this later).
I would argue (though I’m sure I’m biased) that of all the “geographically dispersed” storage solutions, VPLEX is head and shoulders above the rest, but this isn’t the place/time for that.
As I’ve recently pointed out, VPLEX is officially supported with a stretched vSphere 4.x cluster, using VM HA (solution type 1 in the list above). Solution type 2 is also explicitly supported You can read the official KB on this solution here.
In the next article in this series, I’m going to outline EXACTLY what you need to understand about the period we call “partition” – where the two EMC VPLEX clusters that support distributed virtual volume can’t talk to one another.
Then, I’ll describe the direction we’re aiming for over the next few years. VMware and EMC aren’t staying still on this front – lots of coolness coming.
Another brilliantly informative post Chad giving great explanations, i have seen a huge increase in Business Continuity solutions against Disaster Recovery over the last 12-18 months as clients realize the flexibilty virtualisation brings to the datacentre. VPlex GEO will be the next "sexy" option as you say and the term "Disaster Avoidance" will sit well with the enterprise.
Posted by: Paul | November 30, 2010 at 02:51 AM
Chad, I thank you for taking the time to drill into the detail on these options. You are right that a lot of the customers I talk to get these options confused, and start thinking they can mix vMotion and SRM for the same VMs between the same sites. I look forward to the 2 follow-on posts on this subject. (Disclosure - I work for EMC)
Posted by: Vince Westin | November 30, 2010 at 06:09 AM
Excellent article, sir! I appreciate the depth of your analysis and the vendor neutrality. You provide information and call out where the technology is still a bit ugly. :)
Posted by: Doug B | November 30, 2010 at 02:34 PM
Hi Chad,
you said and I agree : "VM HA will take care of restart” is only true if you accept running at <50% utilization on your hosts. "
but when using SRM, correct me if I'am wrong, you also need to keep enough ressources (cpu, ram etc...) available in order to power up all your vm on your secondary site ? So you need with SRM to run in an active-active site with <50% utilization on your hosts in both side ? thanks
Posted by: Stanley | December 01, 2010 at 02:55 AM
Good article. pretty much what I'm finding within the community.
People confusing what Disaster Recovery (DR) really means, what are the relevant and correct technologies, what really needs to happen when a disaster occurs ( bearing in mind your state of mind during a real disaster, you don't want to be doing too many manual steps here). The ability for a DR team to easily test this is also very important.
Then technologies Which can give you Disaster Avoidance (DA), and again which technologies and configurations work, and then the implications.
And a lot of the time, all the technologies you mentioned within your post are jumbled and confused, and people assume if you get DA, you also automatically get DR, and it's all automated, and no implications for their RPOs and RTOs.
Posted by: Faisal | December 02, 2010 at 07:50 AM
Good work Chad, am very much looking forward to part two.
I know that this is already an ambitious content-fest in three parts, but...
While you are talking about DR/DA/SRM/VPLEX and explaining the important "this with this yes, but this with that not a chance." relationships which exist between these products, Is there any chance you might be persuaded to throw vCD into the mix too?
It's a big ask, and you might risk wedging in so much content that the posts start to leak slightly at the sides, but I can't be the only one in reader land wondering if SRM and vCD are mutually exclusive and where VPLEX might sit in a cloud DR/DA stretchy clustered type thing.
Posted by: Ian Sutton | December 03, 2010 at 03:05 PM
First off, great post. This topic really deserves some discussion and will certainly be more and more of a focus over the next few years for the industry, customers, etc.
I know this could only fit certain use cases, but what about using VMware Fault Tolerance (FT) across sites (assuming L2 adjacency, extremely low latency, etc.)? There are obvious limitations today with FT such as single vCPU, no thin disks, etc. but I anticipate those will lessen with time as the technology matures in vSphere and Intel processor architecture. There are obvious pros/cons to what FT provides vs the other solutions but it sure seems like a good fit (if and when supported) for certain uses cases that would possibly be much less complex than others if willing to live with some trade offs.
Posted by: udubplate | December 27, 2010 at 10:08 PM
@Stanley
The requirement for SRM to have equal capacity available at the recovery side isn't a hard and fast necessity. It's common for many organizations to have a BC plan that accepts non-critical application as either running in a degraded state or having a delayed recovery. When I say delayed, I mean weeks or even months. Planning to acquire hardware after a disaster for tier three apps can be an acceptable strategy... remember there's a huge difference between recovering the HR Benefits -> Discounted Movie Tickets30 days after the disaster versus losing the application entirely. I've even had clients who plan to "cycles" their lower tier apps on and off until additional capacity can be implemented.
And yes... I work for EMC as well.
Posted by: Rob Nourse | January 10, 2011 at 05:11 PM
Chad - thanks for this whole series. I, and my peers, have referred to this about 20 times in the past two months for various customers. It has become one of our standard customer collateral posts as we enter into D/R and H/A discussions to set the framework. Personally I've reread the posts about 15 times, making sure I catch all the nuances.
I posted a related blog today around trends and customer requests I am seeing in this space. Thanks for helping make my customer conversations more fruitful. That post is at http://bit.ly/dPEGd1
(Disclaimer: I work for a EMC, VMware and Cisco partner, and formerly worked at EMC).
Posted by: twitter.com/jgcole01 | January 28, 2011 at 09:21 PM
Chad, what's new about Disaster Recovery for vSphere 5 with SRM 5, VPLEX and so on?
Posted by: Vlad | August 19, 2011 at 04:51 AM
Saved my night
Posted by: Nicolas Solop | January 11, 2012 at 10:28 PM