Well, it’s here. What you ask? the secret I was speaking about in code language at VMworld 2009 in TA3105 (Long Distance vMotion), and whose code name Steve Herrod blurted out in the VMware/EMC keynote :-)
I’m not kidding when I say “VM Teleportation”. vMotion moves the VM from one host to another while the VMDKs stay still. Storage VMotion has the VMDKs being copied from one place to another while the VM stays on a single host.
VM Teleportation means (to me) a vMotion where the VMDKs being acted on are immediately available in the other side – moving everything, and moving it instantly over geographic distance.
At EMC World today we vMotioned 100 VMs between clusters in 4 minutes and 500 in 20 minutes. That’s 2.4 seconds per VM. Across the equivalent of 100KM. Want to see what that looks like?
And that’s speed of mobility is before the vMotion improvements right around the corner in vSphere 4.ahem…
Pause for a second and think about that.
We didn’t “failover” a cluster or a storage object. We didn’t move a datastore or a group of datastores. As the VMs disappeared from one ESX host with storage at one side and appeared on the ESX host on the other side – it started immediately accessing storage on the other side. Access Anywhere.
So, in early April, EMC quietly released VPLEX, a product which has been in development for some time, and in beta with many customers since late 2009. Today, the announcement is public. Customers are using it with PB in use, so we’re not talking about science fiction but science fact.
It’s something new. Comparisons are a bit tricky – though I’m sure we’re going to get people flaming all over the place. I don’t know of anything that does this.
Sure there are things (from several places) that present a storage device over to hosts via a SAN/NAS and then can failover which device supports that storage object. That’s not VMotion over distance if you ask me. That’s “stretching an active/passive storage cluster”. Those approaches can of course be used in this use case – and I invite the comparisons people will inevitably make.
VPLEX enables a storage device is simultaneously presented at both the local and remote side in a read/writeable state in an active/active model – any IO can be served by any VPLEX node, by any port – at anytime. We’re calling this “Access Anywhere”. The most important idea is that the core tech enables N node scaling
This means that it’s possible to have a LUN/VMFS (and of on top a LUN of you could have another filesystem of course) in a high-performance transactional use case exist simultaneously locally and also in multiple places at once.
- It is the technology that I demonstrated moving an individual SQL Server VM under load across 200km in 10 seconds non-disruptively back in Sept 2009 at the bottom of this post here.
- It is a technology AND a product. That means that while the first product to leverage this technology (geo-dispersed write cache coherency) is VPLEX (which can also be used with non-EMC arrays), customers should expect to see it also in existing EMC products over time.
- And this is where it gets mind-blowing… while the initial release is focused on synchronous distance use cases, we have direct line of sight to async-class and even “many site” use cases.
So what IS it? How does it work? What doesn’t it do? How does it fit with VMware? Did we solve the speed of light problem? What about the classic teleporter problem of “splinching”? What are our friends at Cisco working on along with VMware and EMC around this idea?
This is a whole new category of storage technology. We think VPLEX is a “category creator” (something we’ve done time and time again, and will continue to do).
Since it is new, if you want to have your mind blown – see details and more demonstrations - hang on to your hat and read on for more :-)
Why is this important? Well, one of the key tenants of the “journey to the private cloud” is not only being able to consume things differently (via all sorts of self-provisioning models amongst many things), but also being able to break the barriers of the physical datacenter – being able to do things across geographic boundaries.
VPLEX has a ton of use cases, but one of the biggest ones is “vMotion over distance”.
It’s kinda fun at this point to point back to TA3105 in Sept 2009 – since at the time, the VMware, Cisco and EMC teams that co-presented were working already using the technology everyone knows about now.
Look at the slide I presented that showed what “Option Two” (code language for the “VPLEX" tested config") looked like:
I don’t know how many times I need to tell people – come to the EMC Keynote at VMworld (along with the VMware/Cisco/EMC keynote) – I’ll ALWAYS give you a 6-12 month window into the things we’re working on…
Look at the list of “what we hear customers asking for” from that deck:
EMC VPLEX delivers on ALL that.
Now that I can be explicit, let me expand with details on the list. In doing that, I’m going to do something I’m loathe to do in general, which is beg direct comparisons. The list below might highlight why I think VPLEX is a technology "category creator". Like anything new, direct comparison is hard. Apply this list to anything you think of as analagous. You’ll find things that match some of the architectural models, but I think you won’t find any that match all.
- Support active/active IO on both sides, with "Access Anywhere" meaning ANY IOs are served ANYWHERE, at ANY TIME, as opposed to stretching IO across the WAN/MAN until some sort of "cluster failover". Why does this matter? Well – in the “storage is presented by one “brain”/ports, then fails over to the “brain”/ports . Does the comparison tech do this? VPLEX does.
- Granular vMotion at the VM level with the information (can be a forward, or a locally served IO) immediately available at the remote side, not some sort of datastore or even worse, "group of datastores" level. Note my comments in the “additional services” section above. Does the comparison tech do this? VPLEX does.
- The solution can't reduce local HA or performance. This implies a bunch of things.
- It means you need to have local HA in addition to the geo dispersion. The smallest VPLEX (and the entry configuration and price is ridiculously low) is itself HA. If you don’t have local HA while also stretching across distance you’re trading off local HA for the ability to failover to a remote site.
- It means you need to scale out. If you don’t (for example if you are a 2 node cluster), the scaling model is inherently going to be limited, and intrinsically means you can’t drive past 50% utilization without being in an oversubscribed state. If you do that, it’s OK – but you need to plan for it (see this post on that topic). VPLEX starts in a single HA “Storage Engine” design, and grows up 4 Storage Engines (8 if you have 4 on each side). That means scaling to a total of 8 members of the local cluster (16 if you have 8 on each side). Each engine is a multicore Xeon-based system.
- You would need to be able to have this capability without necessarily having all the same storage at each site – in other words heterogeneity.
- It means you need to have geo-dispersed cache coherency – with the ability to scale out to n nodes. There are a series of technologies out there (HP Lefthand is the one that is closest to VPLEX in some ways) that are close to the list, but without this core element, performance in these use cases are limited to smaller ones since read/write caching models are critical for every storage array. Also, without this at a very low level, you have do depend on higher-level stack coherency ideas. This is the core reason why this technology, the intellectual property and the patents are so cool. Heck ideally, the solution wouldn’t reduce performance (or necessitate running at 50% utilization to deal well with failure scenarioes) it would improve performance! Does the comparison tech do this? VPLEX does.
- Customers told us the technology is fine if metro-distance today, but needs a path to >100km distance tomorrow, and even multiple sites. While today's GA of VPLEX is Local and Metro, the fact that we're talking about timeframes for Geo and Global means we have line of sight to solving those. This is another important part of the intellectual property behind VPLEX.
Why not just use Storage vMotion? If you’re a small use case, or you just want to move a single VM – you absolutely can. It will take a lot longer, and but the upside is that you can do it independent of any storage layer-technology. How much longer? Well – in the testing we did way back in September 2009, it took 10 seconds vs. 14 minutes and 10 seconds. That means with that particular VM, you’re talking about 100 VMs in 16 minutes (VPLEX + vMotion) or 24 hours (Storage VMotion + vMotion).
Beyond that test, Brian Gallagher will present more in his keynote – including a whole whackload of performance data.
Ok, taking the questions I expect (and people under NDA have been asking – so these are the questions I tend to get):
Q: Is it a product or a technology?
A: BTW – that’s what I was trying to answer after TA3105 at VMworld 2009 in the picture below.
After the session, people mobbed me and asked asked “is this VMAX, Recoverpoint or something new”, my answer was a squishy “both, I and can’t talk about it anymore”
The answer is both. The technology can and will be applied in other EMC products over time (for “all EMC” customers). VPLEX is the first application of this core technology and is a shipping, generally available product, and can be used with ANY array.
Q: Chad, stop dancing… So what IS it?
A: VPLEX at it’s most fundamental is a scale-out, active/active geographically disperse, cache coherent in-band storage federation device. There, easy :-)
This is what the non-marketing (this is the actual kit that powered the keynote demo – needed a flashlight as they were prepping for the concert) picture looks like:
From the back:
From the front:
Look familiar? Of course – it’s the same core storage engine that powers all our stuff. The product teams are all software teams, and there is a kick a$$ hardware team that supports them all. This gets us boatloads of efficiencies, and also means we can match Intel’s tick/tock cadence.
Ok, putting aside the semi-facetious techno-marketing babble, it actually is ridiculously easy and also ridiculously cool. Basically, you attach any storage to it, and like in-band virtualization devices, it virtualizes and abstracts them. Yeah, this has use cases for non-disruptive migration, but that’s so “old school” storage virtualization :-) Frankly, everyone using VMware gets that locally for free using Storage VMotion. We needed to do something cool :-) The keys are:
- it that it is active-active – which means the logical volume is simultaneously read/writable on all the nodes shown in the diagram above. (two were drawn, but it can be many) This is something new.
- You can scale it out – add more nodes for more performance. This is EMC’s 4rd scale-out storage platform that uses commodity hardware (but designed like a tank) – others are VMAX, Atmos, Data Domain, Avamar. Where it makes sense, and where we can without sacrificing other customer priorities, we’re big on “scale-out” and “ride Intel’s R&D curve”. Explicitly - VPLEX uses the same “Storage Engine” hardware we use in all EMC storage products. Out of the gate, VPlex Local can start with a single Storage Engine (2 “brains”/cluster nodes) and scale up to 4 Storage Engines (8 “brains”/cluster nodes). The “Storage Engines” are the same Intel multi-core commodity hardware we use in all our platforms. Expounding on “ride Intel’s R&D curve comment earlier” – Pat Gelsinger (EMC COO, and former key exec at Intel, one of the fathers of 386, 486, Pentium Xeon and Core families) is adamant about leveraging this common “Storage Engine” to enable us to ride Intel’s platform ticks and tocks as fast as possible (which is very good!). Of course, while we use commodity hardware, it’s engineered to a high, high standard (much higher than a normal server). In a VPLEX Metro use case, the number of storage engines can double (to a total of 16 nodes in the cluster between the two sites).
- if you forward the IO over the WAN (like an “active/passive” model does until you fail it over), the IO’s have to cross the WAN, which has an impact. But, VPLEX serves the I/O locally on BOTH SIDES without waiting for the IO commit on the remote side (this the magic of distributed write cache coherency).
Q: How does it work?
A: The key is the distributed cache coherency model. To understand this better, look at the diagram below.
When you think about this – this what means that ANY I/O can be served up at ANY moment by any VPLEX node or port, regardless of site. This is the key to “Access Anywhere”. This is also why we’re able to deal with ANY LEVEL of use-case granularity. In the VMware use case, the natural level of granularity to move an “object” is a VM. In other use cases (think stretched Oracle RAC environments), it could be something different.
The key is that VPLEX doesn’t solve the “what if two updates to the same block occur at the same moment?” problem (this is the core of the “did you solve the speed of light problem” question). The key is that MANY use cases ensure a single “writer” (single host writing to block at any given time), we don’t have to solve the “speed of light” problem (working on it :-) In the VMware use case, for example a single ESX host is the “writer” for a VM at any given moment, and a vMotion is an atomic operation – at one moment, one host is writing to a set of blocks, and then at next moment, another host is writing.
Does this mean that this competes with things like Global Namespace/Scale out NAS? No. It’s adjacent and complimentary. These mechanisms generally use file semantics and distributed lock management. You can easily imagine how they could be accelerated by VPLEX for “Access Anywhere” to the underlying blocks supporting a distributed filesystem use case.
Q: Wait a second – doesn’t Atmos federate storage across geographic distance?
A: Absolutely! EMC Atmos does federate across geographic distance, and in fact does it incredibly well, and over pretty well any reasonable latency (if latency is too high, you get loads of TCP/IP retransmits). The fact that Atmos does this, and customers are demanding this sort of native Web 2.0 object storage that scales to internet scale was recently validated by our respected competitor NetApp’s acquisition of Bycast, which was an EMC Atmos competitor.
They key to understanding what’s new here is that things that deal with “geographic dispersion” of information that are for non-transactional use cases (such as EMC Atmos) is VERY different. The idea of low-latency and I/O scaling isn’t the same. The challenges in the “Web 2.0 App storage” use case are around SOAP/Rest presentation, metadata handling and uber-scale. They are NOT “can you respond in single digit ms to that IO request?”. That kind of I/O response demands some sort of write caching, which gets us back to why that magic of distributed write cache coherency is so important.
BTW - That’s why behind VMware-based IaaS clouds we need transactional storage models (SAN/NAS) and don’t use Atmos “under” VMware, though we do “alongside”.
By “alongside”, I meant that using VMs running on IaaS (on a transactional storage model) to power next-generation applications that in turn leverage Atmos or Amazon S3 is very popular.
Q: Ok, based on what I’ve understood, I “get” synchronous class distances, but I don’t get how this could eventually work over async class distances. How would that work?
A: Ok, first thing – to level set, as of April 2010, VPLEX Local (in a datacenter) and VPLEX Metro (up to 100km dispersed) are generally available. VPLEX Geo (async distances) and VPLEX Global (multiple sites) are future capabilities that are on the product roadmap. EMC’s trying to get into the pattern of being more open with where we are going, then hitting our deliverables consistently (like we did it with 10GbE Ultraflex IO module support, have done it with LUN-level FAST and wait just a little bit for sub-LUN FAST and more :-)
So – this is what we put out there at EMC world.
So – how would async work? Well, suffice it to say that not only is there a geographically dispersed write cache (with an ingenious destage capability without using spinning disk, but that’s another story for another day) but we have the ability to periodically sync the cache map in a coherent way along with the underlying storage. Again, the key is to understand the single-writer use case. I’m going to repeat myself here if you didn’t read the earlier QnA that talked about this.
Look at the vMotion use case as an example of a “single writer” (and there are many others including some distributed filesystems, Oracle RAC and others) – there is an atomic event at the end of the vMotion where the writer to a set of blocks transitions. There is NEVER a moment were host A writes to block 1 and host B writes to block 1 at the same time (a key to it working in the first place).
If we work with VMware to leverage several things we have at our disposal – in future vSphere releases - to insert a momentary “pause until VPLEX says it’s good to go”, the VMFS volume would exist in two places at once, slightly different in steady state, but always EXACTLY THE SAME at the critical moment of cutover for any given virtual machine.
It’s important to understand that this latency impact of the “pause” would be apparent at that moment, not during steady-state. So – if you could vMotion a VM from NYC to Paris in a few seconds, and at the tail end of the vMotion, the VM was stunned for a few extra hundreds of milliseconds – how cool is that to move that VM live, rather than offline?
Q: What else is possible?
A: There are a whole whackload of use cases beyond VMware – stretched Oracle RAC, stretched high-performance filesystems. Like any new thing, all sorts of new possibilities open up. I’m excited to see what other ideas/use cases surface over time. That all said, how cool is it to be able to teleport 1 VM or thousands of VMs across a hundred kilometers now, and in the future, 1000’s of kilometers?
Q: How hard is it to configure?
A: Actually, it’s cake. Here’s us setting up a stretched VMFS volume on the actual config in the picture above (which is what we did the live demo on).
Q: Is this what you mean when Paul Maritz, John Chambers/Padmesree Warror and Joe Tucci draw “federation” between the “internal cloud” and the “external cloud” to make “hybrid/private clouds”?
A: The answer is: yes, in part – but that little line also means a lot more (network, storage, application scale-out and federation). Federation also doesn’t mean that the movement HAS to be live, but just like the initial attempts by VMware competitors to claim that “you don’t need VMotion, cold migration or ‘kinda sorta warm migration’ are fine” didn’t exactly make people sing. Ultimately customers WANT VMs to move between their datacenters and into service providers live.
We think that the intellectual property in VPLEX (and remember, it is software after all – just manifested first as a platform) is a core ingredient to enable “vMotion anywhere”. Obviously there are other critical parts (the network questions), and VCE powered enterprises will be able to connect to VCE powered service providers (even if they put VPLEX in front of non-EMC Storage!) and if they both have VPLEX can move VMs around willy nilly.
Q: So what about the network part?
A: well – vMotion requires that the source and target ESX host be on the same layer 2 segment today and the VM’s IP stays constant throughout a VMotion, which means “stretched VLANs” and dark fiber for most folks.
Cisco has been working on a family of technologies called “Data Center Interconnect” (or DCI) – and there are some critical ingredients in that stack. Open Transport Virtualization (OTV) which is supported today on the Nexus 7000 allows you to have an IP address disappear from one place and appear on another, decoupled from their normal IP addressing scheme. Crazy cool. About as crazy cool as VPLEX :-) If you want to know more about DCI – you can connect to a joint webcast Cisco is holding later in May.
Q: Is it available in Vblocks? What about mission-critical applications?
A: like all technology from each of the members of the VCE coalition, it always takes a quarter or two before something new appears in the Vblock validated architectures. This is a good thing. It represents the amount of time it takes to revalidate the entire stack. We will be using VPLEX as an option (but not the only way) to federate multiple Vblocks into one big pool.
In fact, this reference architecture document discusses not only VPLEX testing with a Vblock 1, but also:
- Sharepoint (several million documents, 111,600 users)
- SQL Server 2008 (40,000 clients with OLTP workload)
- SAP ERP 6.0 & BW 7.0 (test included long-distance BW extraction)
- Oracle 11g with E-Business Suite 12.1
….all being vMotioned over 100km and all running on vSphere 4 of course.
Q: I want to hear from some other people – you’re clearly biased.
A: I **am** clearly biased. Here are some customer’s and 3rd party analyst’s thoughts:
- AOL Customer Video: http://www.emc.com/collateral/demos/microsites/mediaplayer-video/aol-emc-vplex.htm
- Melbourne IT Customer Video: http://www.emc.com/collateral/demos/microsites/mediaplayer-video/melbourne-it-emc-vplex.htm
- AOL Case Study: http://www.emc.com/collateral/customer-profiles/h7150-aol-virtual-storage-cloud-computing-vplex.pdf
- ESG Lab Report: http://www.emc.com/collateral/analyst-reports/esg-vplex-metro-vmware-esx.pdf
Q: What doesn’t it do? How does it fit with VMware?
A: VPLEX doesn’t deal with “multiple writers” (ala what distributed lock managers do in distributed NAS use cases). It also doesn’t support Async use cases today, ditto with more than 2 sites. There is currently no VMware Site Recovery Manager adapter (working on it).
The more important question is “how does it fit with VMware?”.
The answer is basic: It enables high performance, high availability geographically dispersed storage that’s active/active – read/writeable on both sides. This can be used for two fundamental use cases (separate from the question of whether it’s front-ending EMC storage or other storage vendors’ stuff.
- Use Case #1: vMotion between different vSphere ESX clusters being managed by a single vCenter.
- Use Case #2: stretching an vSphere ESX cluster across geographic distance.
Both are supported (VMware KB article for VPLEX is here. Note that it doesn’t explicitly call out Use Case #2, but we’re just finishing crossing the T’s and dotting the I’s.
But, I REALLY want to highlight why my recommendation is Use Case #1.
Why? Well – sometimes blog posts are “oldie but goodie’s”. The one I wrote back here entitled “The Case For And Against Stretched ESX Clusters” has stood the test of time. Note that the points I was trying to hammer home HOLD TRUE FOR VPLEX. For example, one thing that applies to both Use Case #1 (vMotion between clusters) and Use Case #2 (stretched clusters) is the amount of bandwidth. VPLEX needs a lot – even though it is VERY efficient, and transfers only the changes, and can even be creative (since it can forward IO like a cache to the remote site). If you don’t have a lot, whether it’s dark fiber, or IP (used via FC/IP) – you needn’t apply. Just like I said back in that post, if the EMC rep says “yeah, it will work on your 128K frame relay service” – back away slowly.
So, why not just stretch the cluster itself over distance?
Generally people go for Use Case #2 where someone (usually a foolhardy storage vendor unwise in the ways of VMware) says “just use VM HA for disaster recovery!”. Remember (and these are all true, regardless of whether it’s VPLEX or anything):
- DRS (until vSphere 4.ahem) has no way to create VM-level “affinity” to groups of ESX hosts (there are only the affinity and anti-affinity rules which it would be nearly impossible to restrict a VM to “one half of the cluster”. This means that sometimes the VMs will be on “side a” and sometimes on “side b”. Actually, this is a case where the “active/active” nature of VPLEX shines. Without active/active models, after a VM moves to the side without the storage (until it’s “failed over”), it’s being accessed over the WAN, incurring any WAN latency. With VPLEX, the IOs start being served by the local copy.
- VM HA (even in vSphere 4.ahem) depends on primary and secondary VM HA nodes. It’s not intuitive and is error-prone to manually determine/set VM HA primary node state – which means if you’re not very careful, you can get into a state where if you lose a site, VM HA does nothing.
- VM HA (even in Sphere 4.ahem) lacks any of the sophisticated restart sequencing and control that Site Recovery Manager does (yet).
On that last one, I’m befuddled by people who think that these geo-dispersed VM HA restart models are the bomb (“hey, it’s cheaper than buying Site Recovery Manager!”). When you say “don’t you need restart sequencing” they say “NO!”, but when you ask “does RTO matter to you?” they say “YES!”. Then you ask, “you’ll need to control the start sequencing VM by VM, right?”, then they say “yes, but WE CAN SCRIPT IT!”. Then you ask “So, you’re going to maintain a script to cover you mission critical start/restart including callout integration with other things outside VMware?”, and they say “YES!”.
I’ve never found a customer that answers YES to all the points in the dialog that then actually follows through and maintains and tests the DR process. Personally I think they are running around with scissors.
It’s almost always a storage vendor that pushes them into that position and feeds them the koolaid one jug-full at a time.
Look, can it work? Sure. You REALLY better know what you’re doing.
That’s why I was so emphatic in TA3105: “Think about Disaster Recovery/Restart and Disaster Avoidance as two different, but related ideas”.
The easiest way to understand is this:
“VM HA is good right? Damn straight :-) It’s a solution for ‘local server disaster restart’. Always involves an outage, but after the server is dead, helps recovery to be fast and automated. vMotion is good right? You betcha. It’s local server ‘disaster avoidance’ and ‘dynamic pooled load balancing’. If you know a server IS going to need to go down (but isn’t down), you can put it in maintenance mode, VMotion vacates the VMs, and there’s no disruption. You also have DRS use vMotion to turn your servers into one big pool, and it optimizes.”
Replace “VM HA” with “storage replication and SRM” and VMotion with “VPLEX and vMotion” and “servers” into “sites” and you get the idea:
“Storage Replication and SRM is good right? Damn straight :-) It’s a solution for ‘site disaster restart’. Always involves an outage, but after the site is dead, helps recovery to be fast and automated. VPLEX is good right? You betcha. It’s local server ‘disaster avoidance’ and ‘dynamic pooled load balancing’. If you know a site IS going to need to go down you can VMotion all the VMs, and there’s no disruption. You also use VPLEX and vMotion to turn your sites into one big pool, and it optimizes.”
Q: Obviously I can vMotion a VM right from the vSphere client… but I want that “vTeleport” plugin you showed at EMC World!!! there be a vCenter plugin for VPLEX?
A: oh the goodness – there’s almost too much :-) First of all, massive vCenter plugin update incoming across ALL EMC…. EMC World has only just begun :-) It’s worth noting that the “vTeleport” extension of the vCenter UI in the demo is something we whipped up for EMC World (upside to “on team” killer expertise – thank you Nick Weaver :-). The real vCenter vTeleport plugin is much, much cooler than that, but still under construction. Andrew Kutz (of Storage VMotion plugin on VI3.5 and SimDK fame) is at EMC working on something here that will make every customers smile – stay tuned on that one :-)
Phew. Long post. Thanks for sticking with me. I can’t wait to hear what you think, and what questions you have.
The vSpecialist squad and I have been fortunate to be exposed to this for a LONG time, and have been playing with it like mad. Stephen Spellicy in particular has been burning the midnight oil to pull off the live demo on stage (with a lot of help from many – thank you!).
VPLEX is GA. It’s very real. It works. It’s something new – will be interesting to see customer/market reaction. Based on the NDA briefings I’ve been doing for a while – most people have been ridiculously excited!
Pssst…. another mind-blower? This is something we’ve been working on quietly for years. Think about what we have in the R&D labs going on right now for 2011 and 2012 :-) I love my job :-)