I've been working with 10 joint VMware/EMC customers this week in NY, NJ and Houston (phew!), and was in Australia the week before last where there were 2 more. Out of those 12, 4 asked me questions about the applicability of "stretching" their ESX clusters across geographic distances - that's 33%, and absolutely above the "man, I should write a blog on the topic" threshold.
So, what are we talking about?
A stretched cluster is the practice of having ESX member servers in a cluster that are geographically separated. The reason this is generally done is to provide the ability to dynamically move workloads from one datacenter to another. Often, the customer is also considering it for disaster recovery purposes ("I'll just VMotion in case of a disaster"). Can this be done - ABSOLUTELY - but not considered lightly.
I'll get back to this - but this is usually promoted by storage vendors shelping their goods with little to no consideration of the VMware requirements, thinking only of their own subject domain ("hey, I can have storage at two sites, that must mean I could create a stretched ESX cluster - oh, the customer would like that!"). I hate it when EMC is the instigator (as much as I fight it, it does happen), but more of than not - it's competitors trying to position something they view as unique in the VMware context. This is a bad vendor behaviour (and incorrect - many of us can do this) - it doesn't think about the actual use case.
Ok, so what are the fundamental requirements?
- a stretched LAN/VLAN segment - the ESX servers and need to be connected as if they were local. This should not be routed. VC can be, of course. Latency is actually pretty flexible - 100's of ms is OK.
- TIP: use the das.isolationaddress# (where # is 1-10) advanced properties for the ESX cluster to apply multiple isolation addresses to harden the config - in this case the default (the service console gateway IP) is definitely not good enough (it often isn't - this is a good idea in many cases)
- The stretched storage network should have synchronous replication characteristics. This immediately means campus LANs and MANs and not WANs (or the customer is a telco and long distance DWDM is viewed as "free" - heck, it's not even for those folks!)
- The storage devices (if using block devices) need to have the EXACT same WWN (FC) or IQN (iSCSI), exactly as if it were local - otherwise it will be viewed as a "snapshot" (think LVM.enableresignature behaviour). This immediately excludes using synchronous replication technologies. NFS datastores don't have the WWN/IQN consideration, but absolutely have a similar latency and throughput requirement. NFS Datastores supporting VMs (as opposed to ISOs, Templates and the like) should traverse IP networks designed like iSCSI networks, not LANs.
- Option A: is to literally just stretch the storage fabric and have the storage at one side - but man, you better have insane connectivity between sites, and this is at severe risk to "smoking hole" site failure - say bye-bye to your array.
- Option B: is to use something like EMC Invista, Yotta Yotta or the like, and have a distributed synchronous LUN with two storage arrays.
- Option C: some vendors (Lefthand Networks comes to mind) have a neat trick where they are in essence doing software RAID across x86 platforms on which they run software - so the RAID mirror node can be remote, and "take over" if you lose a site.
- TIP: do the math. Regardless of which method is used, and even if you use compression - it's a huge amount of bandwidth. My favorite analogy is to ask a customer to stick in a USB flash drive into their laptop, copy a big file and looking at the throughput. We all have VMDKs ideal for this purpose, yet another nice bonus of running Workstation or Fusion, right? :-) (BTW - man, I just love the whole Unity thing....). Doing 12MBps is usually very safe with a $20 USB key. DO THE MATH. 12MBps = 96Mbps = a completely saturated 100Mbps MAN - to match the storage throughput of an itty bitty USB flash drive. 12MBps is also, of couse, a joke for a production storage array - which are doing nominally100's to 1000's of MBps. Of course, a silly storage sales guy will tell you that "we only replicate the changes, and use unique magic replication and compression technology so you can do it over your f128Kbps frame relay connection". DON'T LISTEN, AND BACK AWAY SLOWLY - someone is freaking out because they aren't going to make their quota and would sell you their mother. (everyone, BTW replicates "just the changes" and some of us do compression and other tricks - but they don't make this problem go away, just get a bit better. When you create a new VM - that 3GB thinly provisioned VMDK is still 3GB of changes that will cross the pipe. Imagine the case where it's the "extended geographic software RAID" (Lefthand case) - what's going to be the WAN requirement to do a rebuild? Doing this right - with anyone - needs big pipes - think dark fiber and DWDM.
OK - so let's say that you meet all the above. I still think that, in general, this is a BAD idea. Let me tell you why:
BAD IDEA REASON #1: There is no general way to create “affinity” for VM HA (I think there **might** be a way to do this, but I'm digging and can't find anything - anyone?) or DRS (no exception today) to one side. This is VERY IMPORTANT as it means that there is no way to ensure that VM HA or DRS don’t simply move your VMs to the far side, hammering the WAN/MAN. This is often enough to kibosh this idea - and is true of every single vendor's implementation (including EMC's).
The question to ask yourself to think about this clearly: "do I view the WAN/MAN/SAN cost across the sites as no different in terms of performance and cost as local connectivity?" - in other words - if a VM decided to move an ESX server on the other side just because of whatever, would you shrug your shoulders, or would you start to get phone calls and big bills? If you would shrug your shoulders, congrats, you have a big, big budget - and are very, very rare.
BAD IDEA RESAON #2: generally vendors position this to customers as a great way to get “Vmotion across sites in case of a disaster” for no outage. Ah, storage vendor ignorance. Unfortunately for the customer that’s not how VMotion works (source and target ESX servers need to be up, running, and accessible to each other at the same time). When a desperate competitor is raising this as "unique magic" (which is isn't) campaigns - I love it because the competitor just exposed colossal VMware ignorance. When an EMC person does it, I send them to VCP training :-) We're up to 300 VCPs now.
BAD IDEA REASON #3: What about VC? Which side do you put the Virtual Center host (you need it on both if you want to VMotion/VM HA if you lose network connectivity to one side or the other) You could create at stretched Microsoft Cluster config (which we support) – but UGH, this is crazy complex. A better alternative is to use native SQL Server Database Mirroring and then do a manual startup procedure. Another simple idea is to run VC as a VM itself, replicate the storage supporting the VC VM and the VC database, and then manually recover at the remote site.
BAD IDEA REASON #4: In disasters, split brain can be possible. A bigger deal, is the "smoking hole" disaster. If you're using Option A above, you're hosed (the single array is toast). If you're using Option C, you face a long, painful and operationally complex rebuild.
In the future BAD IDEA REASON #2 could be resolved by a VMware feature (Continuous Availability - demo'ed at VMworld 2007) - but this will involve the VM simultaneously running in both ESX servers (most customers will love this for select VMs, but won't use it for all VMs - since every VM then consumes 2x the resources). The downside here is that Continuous Availability (or whatever VMware eventually calls it) will mean that the stretched Vmotion network configuration constraints go from "lose" to "tight". and BAD IDEA REASON #1 might be resolved in a future ESX /VC release (allowing you to create “affinity zones” in ESX cluster for VM HA and DRS).
Does this all strike you all as not passing the KISS sniff-test? It fails that test for me.
Here's my question - why not just use Site Recovery Manager? Yes, it's not instantaneous and transparent (which is oh-so sexy as an idea). But it is fast (recovery time measured in minutes), saves the bandwidth as much as can be done (because it can leverage all the array vendor replication techniques to reduce bandwidth because this is replication, not stretched storage), works in a much broader set of distances and configurations, solves the problem of the "stretched VLAN", and doesn't suffer from the smoking hole or split brain syndrome - all while helping coordinate and sequence VM restart in a real disaster or when moving a workload from one datacenter to another.
Most importantly - it's simple. The goal of information infrastructure isn't to be sexy - it's to work, and work not in a demo or a POC, but in messy operational environments, with busy, overworked IT staff, and turnover so the people who built something can't always be counted on as a "magic decoder ring" (the ol' "WTF did you do in this script here?" effect). The question to ask isn't "could it be done?" (the answer is almost always "yes") but to ask yourself "what could our IT operations actually execute on our worst day, with hellfire raining down around us?"
If you absolutely need to have instantaneous application failover, one would be wise to start by looking a bit further up the stack (think Dataguard, Database Mirroring, CCR, load balancers, app layer failover and the like) - it's way, way easier at that layer - and all work just fine on VMs :-) Leave the lower layers of the infrastructure to solve the problem where it can be solved across wide swaths of your infrastructure consistently and homogeneously (like VM HA does for local HA, and Site Recovery Manager does for DR).
I think when vendors get customers all excited without thinking about the solution as a whole - they're playing silly vendor games, and that's not good for the customer. Are there some customers for whom stretched clusters are a good fit?
Sure - but it' ain't 33% of customers - heck, it's not 1%
Tell me - what do you think?
Chad, I just wanted to welcome you to the blogging world and say "wow"! You're putting up some great information and I for one intend to read every word of it. So thanks, man!
Posted by: Stephen Foskett | June 12, 2008 at 09:42 AM
Chad, another outstanding post. You have quickly made the short list of blogs that I check on a regular basis. I thoroughly enjoy your insight on the topics you have covered so far.
I went to EMC World with an interest in learning more about SRM and came away from there with SRM at the top of my list solutions I am most excited about for our environment.
It can be a very dangerous game that some vendors play when they pitch products w/o thinking about or just w/o caring about how they fit into to a customer's environment as a whole. It's a recipe for a solution sale with a short lived customer relationship.
Thanks again for the great insight.
Posted by: RodG | June 12, 2008 at 11:00 AM
I have to agree with most of your reasons. However the simplicity of having one cluster seems almost to good to pass up and might even make it worth it. We are a minor site and we will probably only have a dozen ESX boxes and for us there is no problem with having a second site with dark fiber connectivity. It's not even expensive for us... The storage side actually might be doable using NetApp and their Metrocluster setup.
However I do really worry about a split brain!
SRM is not trivial either and in case of a failure then you have to worry about doing a manual failback and this step is not that easy to test in production!
Posted by: jar | June 12, 2008 at 02:06 PM
Janake, thanks for the post.
Metrocluster is another of the technologies on the market like some of the other examples I mentioned that can do this.
In that case, dark fiber is needed, becuase you cut the FAS filer in half and stretch the backend enclosures (in effect) between sites. It leverages the way that NetApp does failover (starting the failed filer as a JVM on the other head), and using SyncMirror to have a copy of the data (that's the simple stretched Metrocluster case - 500m is the target distance), or stretching via FC switches (up to 100Km).
Metrocluster is neat and does work - but not for the faint of heart (like most of these stretched storage configurations including EMC's solutions). I'd invite any of my NetApp coleagues and readers to post their perspective.
In the case of a disaster - you will not be able to VMotion (since the source server will be down), you will be doing a mass VM HA operation, which will take as long as SRM would, so there's no Recovery Time Objective (RTO), or Recovery Point
I think you'll find that the MUX gear needed to go over your dark fabric (in any case unless you're literally just stretching the cables), the FC switches needed for the Fabric Metrocluster, and Metrocluster licensing is likely FAR, FAR more expensive than the SRM direction with Snapmirror (if you HAVE to use NetApp :-)
The other thing that I wouldn't underestimate is thta even with 12 ESX servers - you could be talking about 100's of VMs - SRM automation of startup sequence, reporting and notification is not to be minimized. You will need to do it manually otherwise - just letting VM HA go won't work (startup sequence matters).
This is an example along the lines of what I was outlining. While technically possible, I don't get it.
- No improved RPO/RTO (maybe fractional over SnapMirror, but with your infrastructure, sync replicas would be possible - EMC supports this, but NetApp doesn't support SyncMirror with SRM - perhaps this is why they are pushing you this way?)
- relatively complex storage layer configuration to create the stretched back end. I respect NetApp's simplicity - but Metrocluster is not simple. (http://www.netapp.com/us/library/technical-reports/tr-3548.html)
- Compared with just sync replication + Site Recovery Manager, the SOLUTION (not just getting data there) would involve complex manual scripting for failover and restart - not just at the Storage (NetApp) layer (it's true that failover is simple ONCE you have Metrocluster working), but at the VMware layer (unless you're just going to trust VM HA to start everything together and you won't get services failing all over the place. Of course, split brain is also a possibility.
So - who does this make sense for - you as the customer, or NetApp as the vendor (I would argue neither, as it negates one of NetApps great strengths which is simplicity).
I'd have to imagine that if I were NetApp, **AND** if you are committed to Metrocluster (perhaps it makes sense for non VMware other workloads) leverage NFS datastores - they will be the most proven solution with Metrocluster and NetApp. NetApp cluster failover works well with NAS and works with block protcols as well, but with a few more asterisk's.
I hear you loud and clear on failback. Failback isn't built into SRM v1.0 - but there are several failback mechanisms that are documented in the GA release. As well, EMC has worked above and beyond the basic SRM integration requirements to help the automated failback with several of our replication technologies. It's also not complicated even in the most simple scenarioes - you create a recovery plan in the opposite direction - not so much "failback" as much as "failover again". I will tell you this definitively - storage failback is easy (Metrocluster, or any of the array replication technologies used by SRM). What's hard is VM failback (without SRM).
So - is this a case where you are driving the requirement - or the vendor is saying "hey this would be cool!"? (perhaps because SRM doesn't support NFS yet, or because you have a synchronous requirement?)
With a bow to my respected colleagues at NetApp - whenever I've seen this Metrocluster positioning, it's the latter, not the former. Just because it's technically possible doesn't make it a good idea.
Posted by: Chad Sakac | June 12, 2008 at 02:51 PM
Hello again and thanks very much for your insightful reply to my post and for clarifying a few points. This is one of the great thing about blogs and yours is on my permanent reading list.
We're currently investigating different storage vendors and NetApp is by no means the only one, we currently run the EMC Clariion. Every storage vendor has their gotchas and you clearly pointed out one of the gotchas with Metrocluster.
I might be chasing the perfect dream but I do hope that we some day can have a simple stretched datacenter between different site. But alas not today.
When it comes to IT my philosophy is "seeing is beliving"... ;)
Posted by: jar | June 12, 2008 at 03:41 PM
Janake - thanks - BTW, I cringe if I came across as "Metrocluster is complex" - it should come across as "all geographically distributed storage" (Invista, Yotta Yotta, Lefthand's distributed clusters, everyone). I'm throwing them all under the bus together.
That doesn't mean they are bad. It means - BE VERY CAREFUL. Some are complex in setup, simple in operation (Metrocluster, Invista, YY), others are simple in setup, complex in operation (LH).
You need to be careful, because in general (not always - sometimes requirements are strict and must be met), the complexity trade off is the wrong way to go.
What happens though, is the vendors don't share "OK, this is what it's really going to take", and "OK, and this is how it's actually going to work".
Everyone is quick to point out the flaws in others. Trust the vendor who points out flaws in **themselves**, and guides you as a customer on how the SOLUTION would work (flaws and all).
Thank you for being an EMC customer, and if you want to try MirrorView/S across that link with SRM, I'm happy to help you! Regardless - share your experience!
Posted by: Chad Sakac | June 12, 2008 at 04:37 PM
Hey Chad, great post. Only gotcha right now is that MVA isn't supported with SRM. I've sent some mails to a program manager who works with the Clariion developers but she said she was 99% sure it wouldn't be done before Q4. Anything you could do to help speed this up would be greatly appreciated. Most of our customers want some serious distance between sites and don't want to have to buy a Celerra to do it.
Posted by: Ed | June 15, 2008 at 01:05 PM
Ed - thanks for the comment.
The MV/A support in the CX SRA isn't a lot of work - but right now it's stacked behind a couple things that will be obvious in a little while. We'll get it done as fast as we can.
Rationale in the mothership was that with Recoverpoint now supporting the integrated splitter, RP (which is MUCH more feature-rich, support long-distance async configs, adds continous dataprotection and very good WAN compression) isn't much more expensive than RP. You don't HAVE to have a Cisco MDS switch or Brocade intelligent fabric anymore - the CX can split the I/O internally.
Still - I hear you, and it's been a loud drumbeat. Partners and customers steer our direction - we'll speed up straight up MV/A support.
Posted by: Chad Sakac | June 15, 2008 at 06:29 PM
Hey, quick update here - MV/A is coming in the Sept update - and for EMC customers who need it now, contact your local EMC team or me, happy to help you immediately!
Posted by: Chad Sakac | August 27, 2008 at 09:59 PM
A very insightful article on a very complex topic, thanks Chad. With VMware's direction pointing to introducing more and more VCENTERS in the environment (e.g. servers, VDI, maybe some Lab Manager, all falling under an SRM framework) ... it does make you wonder about all the implications of stretching as you've pointed out.
Which for me boils down to - At least with cluster+VCENTER being bound to a site, you always know where things are at. Let's face it, sometimes HA does not completely live up to expectations, so if you don't have a VCENTER Cluster in place, you're environment will be invisible until you get VCENTER UP.
I really like your idea of adding more site awareness into VCENTER. To extend upon that, I'd also say yes to giving it Active-Directory style multi-master redundancy. So in day-to-day operation the site's VCENTERS are responsible for looking after its local hosts, but in event of a failover the role can be run from another VCENTER in the forest. This of course would be dependent on networking, but for simplicity let's presume all VCENTERS have connectivity to all ESX hosts in the whole environment. Also it would give the ability for cross-vCenter clusters to be setup. Perhaps these things could be called Meta-Clusters.
Perhaps VMware will facilitate this, and then rely on VSTORAGE and the storage vendors to write more VCENTER plugins. E.G. Giving VCENTER "Metro-Cluster" site-awareness and also the ability to sync this awareness across multiple VCENTERS. How cool would it be to be able to evacuate a building by evacuating that half of the cluster, putting all these hosts into maintenance mode, and then cut-over the storage, and then cut over to the other VCENTER.
This would all be a major change to the way VCENTER works, but that's also my utopia.
Posted by: Kimono | August 19, 2009 at 08:27 AM