I've been working with 10 joint VMware/EMC customers this week in NY, NJ and Houston (phew!), and was in Australia the week before last where there were 2 more. Out of those 12, 4 asked me questions about the applicability of "stretching" their ESX clusters across geographic distances - that's 33%, and absolutely above the "man, I should write a blog on the topic" threshold.
So, what are we talking about?
A stretched cluster is the practice of having ESX member servers in a cluster that are geographically separated. The reason this is generally done is to provide the ability to dynamically move workloads from one datacenter to another. Often, the customer is also considering it for disaster recovery purposes ("I'll just VMotion in case of a disaster"). Can this be done - ABSOLUTELY - but not considered lightly.
I'll get back to this - but this is usually promoted by storage vendors shelping their goods with little to no consideration of the VMware requirements, thinking only of their own subject domain ("hey, I can have storage at two sites, that must mean I could create a stretched ESX cluster - oh, the customer would like that!"). I hate it when EMC is the instigator (as much as I fight it, it does happen), but more of than not - it's competitors trying to position something they view as unique in the VMware context. This is a bad vendor behaviour (and incorrect - many of us can do this) - it doesn't think about the actual use case.
Ok, so what are the fundamental requirements?
- a stretched LAN/VLAN segment - the ESX servers and need to be connected as if they were local. This should not be routed. VC can be, of course. Latency is actually pretty flexible - 100's of ms is OK.
- TIP: use the das.isolationaddress# (where # is 1-10) advanced properties for the ESX cluster to apply multiple isolation addresses to harden the config - in this case the default (the service console gateway IP) is definitely not good enough (it often isn't - this is a good idea in many cases)
- The stretched storage network should have synchronous replication characteristics. This immediately means campus LANs and MANs and not WANs (or the customer is a telco and long distance DWDM is viewed as "free" - heck, it's not even for those folks!)
- The storage devices (if using block devices) need to have the EXACT same WWN (FC) or IQN (iSCSI), exactly as if it were local - otherwise it will be viewed as a "snapshot" (think LVM.enableresignature behaviour). This immediately excludes using synchronous replication technologies. NFS datastores don't have the WWN/IQN consideration, but absolutely have a similar latency and throughput requirement. NFS Datastores supporting VMs (as opposed to ISOs, Templates and the like) should traverse IP networks designed like iSCSI networks, not LANs.
- Option A: is to literally just stretch the storage fabric and have the storage at one side - but man, you better have insane connectivity between sites, and this is at severe risk to "smoking hole" site failure - say bye-bye to your array.
- Option B: is to use something like EMC Invista, Yotta Yotta or the like, and have a distributed synchronous LUN with two storage arrays.
- Option C: some vendors (Lefthand Networks comes to mind) have a neat trick where they are in essence doing software RAID across x86 platforms on which they run software - so the RAID mirror node can be remote, and "take over" if you lose a site.
- TIP: do the math. Regardless of which method is used, and even if you use compression - it's a huge amount of bandwidth. My favorite analogy is to ask a customer to stick in a USB flash drive into their laptop, copy a big file and looking at the throughput. We all have VMDKs ideal for this purpose, yet another nice bonus of running Workstation or Fusion, right? :-) (BTW - man, I just love the whole Unity thing....). Doing 12MBps is usually very safe with a $20 USB key. DO THE MATH. 12MBps = 96Mbps = a completely saturated 100Mbps MAN - to match the storage throughput of an itty bitty USB flash drive. 12MBps is also, of couse, a joke for a production storage array - which are doing nominally100's to 1000's of MBps. Of course, a silly storage sales guy will tell you that "we only replicate the changes, and use unique magic replication and compression technology so you can do it over your f128Kbps frame relay connection". DON'T LISTEN, AND BACK AWAY SLOWLY - someone is freaking out because they aren't going to make their quota and would sell you their mother. (everyone, BTW replicates "just the changes" and some of us do compression and other tricks - but they don't make this problem go away, just get a bit better. When you create a new VM - that 3GB thinly provisioned VMDK is still 3GB of changes that will cross the pipe. Imagine the case where it's the "extended geographic software RAID" (Lefthand case) - what's going to be the WAN requirement to do a rebuild? Doing this right - with anyone - needs big pipes - think dark fiber and DWDM.
OK - so let's say that you meet all the above. I still think that, in general, this is a BAD idea. Let me tell you why:
BAD IDEA REASON #1: There is no general way to create “affinity” for VM HA (I think there **might** be a way to do this, but I'm digging and can't find anything - anyone?) or DRS (no exception today) to one side. This is VERY IMPORTANT as it means that there is no way to ensure that VM HA or DRS don’t simply move your VMs to the far side, hammering the WAN/MAN. This is often enough to kibosh this idea - and is true of every single vendor's implementation (including EMC's).
The question to ask yourself to think about this clearly: "do I view the WAN/MAN/SAN cost across the sites as no different in terms of performance and cost as local connectivity?" - in other words - if a VM decided to move an ESX server on the other side just because of whatever, would you shrug your shoulders, or would you start to get phone calls and big bills? If you would shrug your shoulders, congrats, you have a big, big budget - and are very, very rare.
BAD IDEA RESAON #2: generally vendors position this to customers as a great way to get “Vmotion across sites in case of a disaster” for no outage. Ah, storage vendor ignorance. Unfortunately for the customer that’s not how VMotion works (source and target ESX servers need to be up, running, and accessible to each other at the same time). When a desperate competitor is raising this as "unique magic" (which is isn't) campaigns - I love it because the competitor just exposed colossal VMware ignorance. When an EMC person does it, I send them to VCP training :-) We're up to 300 VCPs now.
BAD IDEA REASON #3: What about VC? Which side do you put the Virtual Center host (you need it on both if you want to VMotion/VM HA if you lose network connectivity to one side or the other) You could create at stretched Microsoft Cluster config (which we support) – but UGH, this is crazy complex. A better alternative is to use native SQL Server Database Mirroring and then do a manual startup procedure. Another simple idea is to run VC as a VM itself, replicate the storage supporting the VC VM and the VC database, and then manually recover at the remote site.
BAD IDEA REASON #4: In disasters, split brain can be possible. A bigger deal, is the "smoking hole" disaster. If you're using Option A above, you're hosed (the single array is toast). If you're using Option C, you face a long, painful and operationally complex rebuild.
In the future BAD IDEA REASON #2 could be resolved by a VMware feature (Continuous Availability - demo'ed at VMworld 2007) - but this will involve the VM simultaneously running in both ESX servers (most customers will love this for select VMs, but won't use it for all VMs - since every VM then consumes 2x the resources). The downside here is that Continuous Availability (or whatever VMware eventually calls it) will mean that the stretched Vmotion network configuration constraints go from "lose" to "tight". and BAD IDEA REASON #1 might be resolved in a future ESX /VC release (allowing you to create “affinity zones” in ESX cluster for VM HA and DRS).
Does this all strike you all as not passing the KISS sniff-test? It fails that test for me.
Here's my question - why not just use Site Recovery Manager? Yes, it's not instantaneous and transparent (which is oh-so sexy as an idea). But it is fast (recovery time measured in minutes), saves the bandwidth as much as can be done (because it can leverage all the array vendor replication techniques to reduce bandwidth because this is replication, not stretched storage), works in a much broader set of distances and configurations, solves the problem of the "stretched VLAN", and doesn't suffer from the smoking hole or split brain syndrome - all while helping coordinate and sequence VM restart in a real disaster or when moving a workload from one datacenter to another.
Most importantly - it's simple. The goal of information infrastructure isn't to be sexy - it's to work, and work not in a demo or a POC, but in messy operational environments, with busy, overworked IT staff, and turnover so the people who built something can't always be counted on as a "magic decoder ring" (the ol' "WTF did you do in this script here?" effect). The question to ask isn't "could it be done?" (the answer is almost always "yes") but to ask yourself "what could our IT operations actually execute on our worst day, with hellfire raining down around us?"
If you absolutely need to have instantaneous application failover, one would be wise to start by looking a bit further up the stack (think Dataguard, Database Mirroring, CCR, load balancers, app layer failover and the like) - it's way, way easier at that layer - and all work just fine on VMs :-) Leave the lower layers of the infrastructure to solve the problem where it can be solved across wide swaths of your infrastructure consistently and homogeneously (like VM HA does for local HA, and Site Recovery Manager does for DR).
I think when vendors get customers all excited without thinking about the solution as a whole - they're playing silly vendor games, and that's not good for the customer. Are there some customers for whom stretched clusters are a good fit?
Sure - but it' ain't 33% of customers - heck, it's not 1%
Tell me - what do you think?