Sigh. This remains one of those frustrating things me. Inevitably, someone binary says “SRM!”, or “vMSC!”. Often (not always), it happens to coincide with:
- the only thing they sell.
- the only thing they know.
I don’t want to blow it out of proportion – but there were a couple things that popped up that made me want to poke at this again.
The answer is “it depends”. Each solution has key things to know. While I would agree that increasingly I’m finding customers wanting “active-active datacenter” models using stretched clusters (VPLEX and vMSC), simple DR using SRM remains the most common answer – for MOST (though it has downsides too), and some customers use BOTH. EACH require the architect of the solution to REALLY think things through.
Read on for two internal emails I’ll censor only lightly, and the answer (at least my opinion).
Here was the first trigger:
“I was at a customer meeting last week with Cisco, EMC and a partner around DR / BC. As the meeting went on and it was time to talk about SRM as the solution for private cloud DR, an EMC SE XXXXXX told the customer that "SRM is legacy...and that 75%-85% of SRM customer's scripts fail during recovery". Obviously I jump in to correct his statement. EMC was pushing Active-Active DC with vPLEX. ( http://kb.vmware.com/selfservi... ) .
I was a little surprised that EMC would make the SRM statement with VMware in the room. Then, I find out that the EMC account teams had this TC going to many of my other customers making the exact statement about SRM.”(I’ve closed this loop – and rest assured, it’s an exception, not the rule).
Here was the second trigger:
“Team,
Has anyone heard of our seen this issue? I don’t have a lot of details at this point but if anyone has been through setting up and troubleshooting issues with VPLEX, vSphere 5, PPVE, stretched HA cluster… and could engage, let me know. The CIO on this account has given a “get it working by Friday (tomorrow) or you’re fired” ultimatum.
- Some portions of the VMWare environment are working correctly (VDI instance) but most aren’t (using VMWare HA to transfer workloads)
- From what I understand some VMWare HA instances won’t start at SITE-B
- EMC SA has been working with VMWare support for several hours and at this point it seems to be a VMWare HA issue (they are looking at logs but - but they seem a little lost)
- It seems that we are caught in this middle ground where neither VMWare or EMC VPLEX support have good integrated solutions capability – so we can solve this
- The EMC SR # is XXXXXXXXX, the VMware SR # is XXXXX
Two likely possibilities have been suggested:
1. Powerpath VE is not passing PDL messages to the ESX kernel. This is a known issue with Powerpath VE through 5.7 patch2 which was fixed in Powerpath 5.8.2. The required changes to the ESX properties have not been modified per the VMware HA&FT Whitepaper written by Olly Shorey
If the PDL messages are being seen by ESX then VPLEX is doing its job and HA should restart the VM (as shown in the Video: http://www.youtube.com/watch?v=xNl8yV131Pg&feature=youtu.be) as advertised. “
Turns out that the core issue was that Host Affinity rules were not setup correctly. Craig Chapman, a great EMCer and vSpecialist (@VirtualChappy) has a write up on the case and the resolution here.
My advice?
If you’re debating this (stretched cluster, SRM or both), check out the session Vaughn Stewart and I did at VMworld last year here. Almost all of the guidance we gave in there is still perfectly correct, and captures the “things to know” either way. Look to your technology partner to help you solutioneer it with you. As always – be skeptical when people make things sound TOO binary or TOO easy :-)
I agree with this. Most customers are interested in Active-Active solutions. I always tread carefully here as there are other surrounding features that you have to be aware of such as layer 2 networking between sites and the mind set that they are active active and the whole concept of this. This is demonstrated by the host affinity groups you mention above which is best practice but not always implemented which can cause issues, and confusion. With VPLEX becoming more cost attractive and the technology progressing at the rate it is, it is hard not to mention active-active. My customers who have this technology in swear by it and will not consider anything less for any of their data centers. If an SRM vs VPLEX document is out there I would like a look if not I need to get creating one......:-)
Posted by: D Swift | February 08, 2013 at 05:00 AM
Disclaimer - Sys Integrator (EMC, NetApp, HP, IBM).
When I search the VMware HCL for supported "FC Metro Cluster Storage" solutions. I see plenty of Fujitsu, NetApp/N Series and some HP and HDS, but no EMC.
When I just use "VPLEX" as a keyword - there it is.
Just sayin'.
Posted by: Hamish | March 10, 2013 at 07:42 PM
Hamish
Disclaimer - I am a product manager on the VPLEX team.
I wanted to respond to your comment above. VPLEX gets categorized by VMware under the FC-SVD category ('Storage Virtualization Device'). If you search under the 'FC-SVD Metro Cluster Storage' category, all the devices qualified under this category show up including VPLEX.
Posted by: Ashish | March 17, 2013 at 09:14 AM
I think the correct answer here is "both". I have been looking for a 3-site solution now for over a year which would let us do both - have a split-cluster at two metro sites share a common replication relationship with a distant third site for SRM. No one can do this yet but I know a lot of effort is being directed at this problem from several different vendors. Ultimately it's about providing the highest levels of availability and preformance to the business.
Can't wait to see what EMC comes to market with....
Posted by: Cdodson | March 19, 2013 at 08:05 PM
@Cdodson - thanks for the feedback, and certainly if you can do both, that is a sweet solution :-) You CAN do this now. A stretched cluster using VPLEX can have a 3 site (async or sync) using RecoverPoint and SRM. This is supported, and lots of positive feedback on this solution - check it out!
Posted by: Chad Sakac | March 25, 2013 at 05:48 PM