OK - from one day of insider EMC threads on VMware topics - three posts.
3) SRM failback - what's the scoop?
A common refrain goes like this:
Question: "Can you failback with SRM? I heard that you can't?, but that it's coming in the future?"
Answer: You absolutely CAN failback with Site Recovery Manager v1.0
What is true is that currently SRM v1.0 doesn't automate the reverse workflow to the failover operation.
I suggest stopping right now and getting this doc, if you're thinking about SRM: http://www.vmware.com/pdf/srm_10_admin.pdf
Ok, so I was with a customer yesterday giving them a best practices update - they just bought 6 CX4-960s for their remote offices, and use a IBM ESS 8000 at their primary site, replicating to another DS8000 in a off-site bunker. The ESS 8000 aka is the "Shark" - IBM's high end storage platform. While I never want to be like Chuck and Barry in blog tone, I have to say, we do blow IBM away at the high end except where IBM takes the whole "outsource and single source everything with Big Blue" position) . When customers get EMC stuff, I always think a good knowledge transfer is a good idea. I asked them about their plans for VMware DR and their comment was:
"We have that figured out. We've even tested our DR plan. We have replicated the LUNs using PPRC (Chad - this is IBM remote replication on the ESS 8000 platforms that competes with SRDF) and Flashcopy (ed. think point in time copy that competes with Timefinder) to create a read/write remote image we present to the remove ESX cluster. We then LVM.DisAllowSnapshot=1. Then we register all the VMs, boot them, and we're up and running. We love that VMware makes DR so much simpler, but we don't quite get the value of Site Recovery Manager which just automates the above steps, particularly since there's no failback"
Ok, I've heard the above comments above before. Don't get me wrong, SRM isn't a panacea, and it's certainly not the second coming, and it's the first release.
But let me quickly hit why I think that they're missing the value (and what I shared with the customer):
- SRM exactly automates those steps. Automation in a DR situation is everything. Buildings will be burning or sprinklers running, and cellphones will be ringing. It's not the time for complex manual operations.
- Could it be manually scripted? Sure. Who will maintain that script? Traditionally - DR was reserved generally for mainframes and other things deemed "mission critical" enough for expensive Disaster Recovery. In those cases, the environments are VERY static - so the idea of creating a DR plan, refreshing it and testing it once a year at a multi-million dollar cost was reasonable. VMware is different, and SRM brings DR to a whole new use case. This same week, I talked to a customer who is adding 100 VMs a week on their infrastructure. Heck, even if you're doing 1 a week, will you update that script constantly?
- They tested a single VM booting. Yeah! They have 400 VMs today. First of all, who's going to manually register all those VMs. More importantly - what is are the DEPENDENCIES between the VMs? There is a specific start sequence needed, or your entire DR plan will not work. I'm always interested in IT how projects needing cross-domain expertise are hard, because everyone trivializes everyone else's work or complexity. AD and DNS, then Exchange/SQL Server, then Sharepoint - and somewhere in their, your hundreds of other VMs - in a specific start sequence. Who will figure out the specific start dependencies the first time, and how will that be maintained in this uber-script? SRM help, and come to AD3500 at VMworld to find out what EMC is doing to make this easy.
- The tested booting the VM on an isolated vswitch. The IP addressing scheme at the remote site was totally different. What will update all the IPs? Update DNS? Do any hosts hardcode IPs rather than use DNS names... anywhere?
- The test (including the one they did) is a useless test unless it is an END-TO-END test. Otherwise, you have told your management that you are ready, and the unthinkable happens and you have failed them, you've failed your business, and you've failed yourself. In other words, a successful "pseudo test" which leads to "we have it figured out" unless you REALLY test - GUARANTEES FAILURE.
SRM is designed for all the things above. The funny thing is that some customers, particularly the largest have what they call a "Business Continuity Team" that is a steady-state IT function, like "the server team". DR is NOT about getting the data from one side to another (although that IS an important part) - it's about all the 5 things I listed above and more. When you talk to a customer who has been through a disaster or someone on their BC team, they understand. When that's not the case, I see that "cross domain trivialization" thing.
Ok, what about failback?
Ok, before you proceed,
- If you haven't seen SRM in action - check it out here, or it being demo'ed live on stage in front of thousands here (scroll to the VMware CTO video, demo is 30 mins in):
- download the evaluation SRM bits here:
or, just buy it here :-)
- get this document: http://www.vmware.com/pdf/srm_10_eval_guide.pdf
- If you're a EMC CLARiiON MirrorView Customer, read this
- If you're a EMC Celerra Replicator Customer (and this will apply for people playing around with SRM using the Celerra Virtual Machine), read this
- If you're a EMC SRDF Customer, read this
- If you're a EMC Recoverpoint Customer, read this
The EMC docs are on Powerlink (for EMC, EMC Partners including VMware, and customers) and not the Internet proper, but I'm getting that fixed - will update when they get posted on EMC.com
Take a look at the VMware SRM Evaluator Guide here (Mornay, Lee, you are stars - not in spite of Chapter 11, but because of it) , Chapter 6: Failback from the Recovery Site to the Protected Site. It outlines the 20 steps you need to do to failback.
Heck, EMC worked to have our Storage Replication Adapters (SRA - think "Array Vendor plugin for SRM") automatically do step 7 (storage personality swap).
Ok - so if steps 1-20 (minus step 7 if you an EMC customer) are manually configured, how does SRM help?
The answer is that failback is literally failover in reverse. If you understand why testing, IP address changes, and workflow rather than scripts are important for failover, you understand why SRM helps with failback.
Trust that we're feverishly beavering away on SRM 1.0 updates (formal addition of MV/A, Recoverpoint 3.01, consistency group handling improvements), working with VMware to get NFS support (as I'm sure my NetApp peers are also doing - NFS on the new NX4 would lock-in EMC with the "lowest cost" SRM solutions along with the AX4 with iSCSI/FC today), and then eventually major feature additions (automated failback workflow, multi-site DR and other goodies),
Moral of that story: I can't think of a good reason not to evaluate SRM today. Anyone have one?
Oh and "my array doesn't support it" doesn't cut it. BTW - in the case above, I'm sure one of the sources of the feedback from the customer was IBM. They would be rapid to pooh-pooh SRM - because the ESS8000 doesn't current support SRM, only their DS4000 arrays that they OEM from LSI do.
Good post Chad.
I have done a similar analysis (although probably less detailed) months ago with the only difference that I outlined the limitation of the product rather than the advantages (that I recognize). You know... I prefer to "challenge" my partners rather then please them too much ... otherwise they would sit still .... ;-)
http://it20.info/blogs/main/archive/2007/12/29/86.aspx
Good stuff.... keep posting.
See you in Sin City @ VMworld.
Massimo.
P.S.: we will enable the DS8000... when it makes sense... ;-)
Posted by: Massimo Re Ferre' | August 30, 2008 at 03:46 PM
Thanks for the post Massimo, and your comments are other feedback I've heard from our largest customers that I didn't mention in the original post. Your post is good, but of course, all DR diagrams have looked like that for years (replace Shark with your pick of DMX/CX/Celerra and PPRC with your pick of SRDF/MirrorView/Replicator/Recoverpoint).
So, let's look at the downsides:
1) What about failover from the array perspective:
a) will it affect the entire config, or just the right things
b) what about where there are two teams and the storage team doesn't WANT to failover when the VMware team says "go?"
2) What about non VM applications?
on 1a) - No, SRM will only failover the LUNs it is handling, and where consistency technology is used, they failover in a consistency group/RDF group. So, no worries there.
on 1b) - this is actually a larger problem in my experience. SRM has no good mechanism to do this. What we've done for some of our larger customers for whom this is a problem (in smaller shops, since they are usually one team - this is actually a feature!) is use the test feature to build and test the DR plan, but then export it using the export feature. So, in otherwords, SRM doesn't automate failover for those customers, but actually automates the process of CREATING a runbook.
On 2) - SRM can (and again, we've done this for some of the customers through the beta and post-GA) act as the "coordinator" for multi-platform DR. For example, with one customer, they had a couple of Oracle 10g EE RAC databases on Solaris (Sparc, so not a VMware candidate) and some mainframe apps using SRDF. We showed how you could insert pauses, callouts, and script actions, but use SRM for overall coordination and storing the runbook. Clearly not something "out of the box", but multi-platform doesn't invalidate SRM, just means some extra help is needed.
For application sensitivity within VMs, you're right - an obvious thing to add, but you can do that now - every VM's start sequencing has a pre-poweron and post-poweron event where you can do whatever you want.
Re: enabling the DS8000 - I leave it up to IBM, of course, I hope you take your time, we're going like gangbuster with SRM at the high end on DMX and SRDF - I think we're the only enterprise array with a shipping SRA (in the Beta since January, and shipping since May) :-)
Posted by: Chad Sakac | September 01, 2008 at 12:35 PM
Any chance you could provide a non-protected powerlink URL to the following document. Its frustrating when we can't/don't share the love: -
If you're a EMC CLARiiON MirrorView Customer, read this
H5583-VMware_Site_Recovery_Manager_with_EMC_CLARiiON_CX3_and_MirrorViewS_Implementation_Guide.pdf
Posted by: Craig Waters | September 02, 2008 at 01:16 AM
I am not going to argue.... this is your blog and you get the last word! ;-)
Massimo.
Posted by: Massimo Re Ferre' | September 02, 2008 at 06:05 AM
Craig - I'm working to get all the docs re assigned to the "public category, and will link ASAP.
In the meantime - if you are an EMC cusotmer, or EMC Partner (including VMware), you can just login to http://powerlink.emc.com and search for H5583, and find it.
Posted by: Chad Sakac | September 07, 2008 at 10:39 AM
Thanks for the reply Chad. I have tried to obtain this document through various channels:
1. I am an EMC Customer with acces to PowerLink, when I try and browse this URL I am informed that this information is restricted.
2. I contacted my EMC rep who stated this document was '...information that can be used to implement the SRM solution and is therefore customer billable...'
3. I contacted a colleague who works at EMC who after speaking with his boss echoed the above statement.
Hence my comment around information sharing and the lack of from EMC. I Look forward to your reply.
Many Thanks!
Posted by: Craig Waters | October 07, 2008 at 08:22 PM
Craig - consider me on the case.
You will have a document in your inbox tomorrow. I will get the documents reclassified.
It's a stupid mistake, and one I'll fix.
Posted by: Chad Sakac | October 07, 2008 at 09:49 PM
Craig - you have the email with the doc. Again, my apologies it was hard to find, and I'm working to make these (and all) documents open to the public. Period.
In the meantime if you or anyone else is looking for something - don't hesitate to ask.
Posted by: Chad Sakac | October 08, 2008 at 09:27 PM
Chad,
As always, thanks for your great insight.
I have been very interested in SRM since I saw the original product announcements. The only downside I see today is the cost of SRM. I am about to finalize a deal for a couple of CX4-120's. I was able to get the CX's approved but just have not been able to get the green light on the cost for SRM (the 1 license for each processor is killing me).
I will be giving it another shot towards the end of 2009 when I expect to have all of my quad socket dual core hosts replaced with dual socket quad (or more) core hosts. At that point in time I will buying half the licenses and may be able to do it. It would have been nice to fit it in with the initial CX-4 implementation next month but so goes life....
Posted by: RodG | December 17, 2008 at 04:19 PM