OK - from one day of insider EMC threads on VMware topics - three posts.
3) SRM failback - what's the scoop?
A common refrain goes like this:
Question: "Can you failback with SRM? I heard that you can't?, but that it's coming in the future?"
Answer: You absolutely CAN failback with Site Recovery Manager v1.0
What is true is that currently SRM v1.0 doesn't automate the reverse workflow to the failover operation.
I suggest stopping right now and getting this doc, if you're thinking about SRM: http://www.vmware.com/pdf/srm_10_admin.pdf
Ok, so I was with a customer yesterday giving them a best practices update - they just bought 6 CX4-960s for their remote offices, and use a IBM ESS 8000 at their primary site, replicating to another DS8000 in a off-site bunker. The ESS 8000 aka is the "Shark" - IBM's high end storage platform. While I never want to be like Chuck and Barry in blog tone, I have to say, we do blow IBM away at the high end except where IBM takes the whole "outsource and single source everything with Big Blue" position) . When customers get EMC stuff, I always think a good knowledge transfer is a good idea. I asked them about their plans for VMware DR and their comment was:
"We have that figured out. We've even tested our DR plan. We have replicated the LUNs using PPRC (Chad - this is IBM remote replication on the ESS 8000 platforms that competes with SRDF) and Flashcopy (ed. think point in time copy that competes with Timefinder) to create a read/write remote image we present to the remove ESX cluster. We then LVM.DisAllowSnapshot=1. Then we register all the VMs, boot them, and we're up and running. We love that VMware makes DR so much simpler, but we don't quite get the value of Site Recovery Manager which just automates the above steps, particularly since there's no failback"
Ok, I've heard the above comments above before. Don't get me wrong, SRM isn't a panacea, and it's certainly not the second coming, and it's the first release.
But let me quickly hit why I think that they're missing the value (and what I shared with the customer):
- SRM exactly automates those steps. Automation in a DR situation is everything. Buildings will be burning or sprinklers running, and cellphones will be ringing. It's not the time for complex manual operations.
- Could it be manually scripted? Sure. Who will maintain that script? Traditionally - DR was reserved generally for mainframes and other things deemed "mission critical" enough for expensive Disaster Recovery. In those cases, the environments are VERY static - so the idea of creating a DR plan, refreshing it and testing it once a year at a multi-million dollar cost was reasonable. VMware is different, and SRM brings DR to a whole new use case. This same week, I talked to a customer who is adding 100 VMs a week on their infrastructure. Heck, even if you're doing 1 a week, will you update that script constantly?
- They tested a single VM booting. Yeah! They have 400 VMs today. First of all, who's going to manually register all those VMs. More importantly - what is are the DEPENDENCIES between the VMs? There is a specific start sequence needed, or your entire DR plan will not work. I'm always interested in IT how projects needing cross-domain expertise are hard, because everyone trivializes everyone else's work or complexity. AD and DNS, then Exchange/SQL Server, then Sharepoint - and somewhere in their, your hundreds of other VMs - in a specific start sequence. Who will figure out the specific start dependencies the first time, and how will that be maintained in this uber-script? SRM help, and come to AD3500 at VMworld to find out what EMC is doing to make this easy.
- The tested booting the VM on an isolated vswitch. The IP addressing scheme at the remote site was totally different. What will update all the IPs? Update DNS? Do any hosts hardcode IPs rather than use DNS names... anywhere?
- The test (including the one they did) is a useless test unless it is an END-TO-END test. Otherwise, you have told your management that you are ready, and the unthinkable happens and you have failed them, you've failed your business, and you've failed yourself. In other words, a successful "pseudo test" which leads to "we have it figured out" unless you REALLY test - GUARANTEES FAILURE.
SRM is designed for all the things above. The funny thing is that some customers, particularly the largest have what they call a "Business Continuity Team" that is a steady-state IT function, like "the server team". DR is NOT about getting the data from one side to another (although that IS an important part) - it's about all the 5 things I listed above and more. When you talk to a customer who has been through a disaster or someone on their BC team, they understand. When that's not the case, I see that "cross domain trivialization" thing.
Ok, what about failback?
Ok, before you proceed,
- If you haven't seen SRM in action - check it out here, or it being demo'ed live on stage in front of thousands here (scroll to the VMware CTO video, demo is 30 mins in):
- download the evaluation SRM bits here: or, just buy it here :-)
- get this document: http://www.vmware.com/pdf/srm_10_eval_guide.pdf
- If you're a EMC CLARiiON MirrorView Customer, read this
- If you're a EMC Celerra Replicator Customer (and this will apply for people playing around with SRM using the Celerra Virtual Machine), read this
- If you're a EMC SRDF Customer, read this
- If you're a EMC Recoverpoint Customer, read this
The EMC docs are on Powerlink (for EMC, EMC Partners including VMware, and customers) and not the Internet proper, but I'm getting that fixed - will update when they get posted on EMC.com
Take a look at the VMware SRM Evaluator Guide here (Mornay, Lee, you are stars - not in spite of Chapter 11, but because of it) , Chapter 6: Failback from the Recovery Site to the Protected Site. It outlines the 20 steps you need to do to failback.
Heck, EMC worked to have our Storage Replication Adapters (SRA - think "Array Vendor plugin for SRM") automatically do step 7 (storage personality swap).
Ok - so if steps 1-20 (minus step 7 if you an EMC customer) are manually configured, how does SRM help?
The answer is that failback is literally failover in reverse. If you understand why testing, IP address changes, and workflow rather than scripts are important for failover, you understand why SRM helps with failback.
Trust that we're feverishly beavering away on SRM 1.0 updates (formal addition of MV/A, Recoverpoint 3.01, consistency group handling improvements), working with VMware to get NFS support (as I'm sure my NetApp peers are also doing - NFS on the new NX4 would lock-in EMC with the "lowest cost" SRM solutions along with the AX4 with iSCSI/FC today), and then eventually major feature additions (automated failback workflow, multi-site DR and other goodies),
Moral of that story: I can't think of a good reason not to evaluate SRM today. Anyone have one?
Oh and "my array doesn't support it" doesn't cut it. BTW - in the case above, I'm sure one of the sources of the feedback from the customer was IBM. They would be rapid to pooh-pooh SRM - because the ESS8000 doesn't current support SRM, only their DS4000 arrays that they OEM from LSI do.