Today, the EMC XtremIO system became GA. There will be a lot of back-and-forth in the industry on this topic. There will be us singing high praises (with customers of course), and then a lot of mud slinging from many fronts :-)
There are three common questions I’m seeing come up – in social media, from competitors, from customers:
- Why buy vs build? More specifically – why not build it, why not just run with all-flash VNX/VMAX (perhaps with some focus on differentiating on the media in some way)
- Isn’t this going to cannibalize some VNX/VMAX business?
- Why so long? More specifically – the acquisition was back in May 2012, and we originally targeted a Q2 2013 general availability – how should people read into it, and read into others who have been in market longer?
Everyone will come out swinging here (and I’m sure EMC will as well). Everyone will trot out their “battlecards” (and trust me, everyone has them).
I’m (as always) going to try to be as transparent as possible. This transparency and detailed technical discussion is “sunlight”.
I think that sunlight and explaining engineering rationale is always good.
I think I have a somewhat unique/fortunate perspective as the leader of the systems engineering team at EMC - the largest player in this space. That gives some degree of “insider access” through early stages of thinking (including M&A) through the phases of bringing technologies and solutions to market at scale… And frankly, how we approach that at scale, is, well… pretty cool.
So, without further ado, here’s the story of how we got here, what are the unique architectural things that I think are really cool about XtremIO and what I think is going to happen next…
So – 5 years ago, the whole company embarked on a “flash is going to radically disrupt the business of persisting information”, and set upon a broad “apollo mission”. We rapidly came to the conclusion that the disruption of flash was going to be broad and pervasive – affecting existing architectures & enabling new ones + affecting existing workloads & enabling new ones.
On the architectural front, a few things (even 5 years ago!) jumped out:
- Assuming that everyone would have access to the same core NAND technology from a narrow set of fabs (there are really only a few biggies: Micron, Samsung, Hynix, Intel, Toshiba) – flash tech (SLC, , eMLC, cMLC, and TLC) and packaging (SSDs, PCIe) would be pretty common across the industry – not the place to try to innovate, as this would intrinsically commoditize quickly. One large early player is feeling this harshly now – and I suspect you know who I mean. This is a really, really hard problem. It DOESN’T MEAN you can’t innovate around the hardware layer, but that it’s very tricky – and you tend to get outrun over short timeframes.
- Hybrids would continue to serve, for the foreseeable future (and remember, in IT land, anything past a 3 year horizon is pretty unknowable), the bulk of the IT workloads. This was intrinsic economics (of magnetic and flash media) and the fact that a huge swath of the market has blended workloads and puts them in a common place.
- Hybrids would be pretty heavily impacted – mostly because their IO paths and code was not designed for ultra low latency (messes with caching), mature RAID would cause write amplification (bad with flash), and perhaps most fundamentally, the architectures were not designed for hundreds of thousands to millions of IOps being commonplace.
- That the potential biggest disruption (even more than All Flash Arrays or “AFA”) is in the world of server flash – not so much on the hardware side (see point #1), but more on the opportunity for new “mashups” (sometimes called “hyper-converged”) architectures where internal storage in the server is shared/distributed. This architectural model is fundamentally enabled by server-based flash (for low latency transactional workloads as a cache/buffer/tier) and 10GbE. Think of VMware VSAN, EMC ScaleIO, Nutanix, Simplivity and others as early examples of more and more to come.
- All-Flash Arrays (“AFA”) would disrupt the Hybrids, in two particular places: 1) where the dataset coupled with inline dedupe (and importantly RELIABLE, CONSTANT inline dedupe) drive a different economic curve for flash media; and 2) workloads where the dataset may not be dedupable, but is relatively small – and focused more on max IOps, per IO latency, IOps density, and $IOps. Those two spots “define” the AFA sweet spot for now – and will expand as flash media cost of all types continues to drop. BTW – people shouldn’t over rotate no this topic. Disruption != displace. It’s more like overlapping venn diagrams, and also there is an aspect of “time”. Think of all of the AFA players – and you can see these workloads are their target. We did note that this was going to be a brutal game. What’s interesting to me as people talk about AFA startups, people talk about revenues, but they are rarely profitable. Their burn rates are often enormous, and the flame-out is huge. One large early player is feeling this harshly now – and I suspect you know who I mean.
These 5 observations lead to a macro conclusion: a real flash strategy is more than a product. It’s Hybrids, Server Flash, distributed server storage stacks, and all flash arrays – and the goal is to be the best in all of these areas.
One thing I just love about EMC is that do believe that we need to continuously innovate, do it organically and inorganically, disrupt ourselves – and try like crazy to stay focused on the customers. If we do – while we do disrupt ourselves – we continuously gain marketshare as customers vote with their dollars. This, BTW is the answer of “will AFA cannibalize some of the Hybrid market” – the answer is “sure!”, but to be clear – when people are hyperbolic and say “Hybrids are dead!” – they are wrong (and I suspect that when you look, they are an AFA pure play :-) We don’t worry too much about cannibalization. As you’ll see from some of the below, people will keep selecting Hybrids (in our view) for some workloads. In some cases, they’ll even use all-flash configurations of those hybrid architectures (for specific use cases, or for specific data services or host types). But – for the right workloads (which will expand over time), AFA designs that are build ground up for flash will indeed cannibalize other architectures to some degree.
Now, while we embrace self-cannibalization, AND believe that it’s a net positive (for the customer and for us), we’re not perfect, and we make mistakes – but damn the torpedoes, we will fight, and fight to win.
Ok, let’s look at these 5 strategic observations about the disruptive effect of flash and their strategic implications, one by one:
- This first observation lead to: don’t follow FusionIO (and we didn’t), instead focus on the software around server flash (we did), and partner with PCIe flash vendors. It also said to not assume that drive/PCIe will be a differentiator for anyone (and increasingly it isn’t), but to partner as closely as we can with the flash manufacturers and leverage every volume advantage we can (and we did). I don’t think we’ve nailed this yet – but I think that as we continue to do things that add software value to server flash (like ScaleIO), I’m pretty confident that this is the right way to go.
- This second observation lead to: put the pedal to the metal on the hybrids – and continue to invest in innovation there (and we did – VNX MCx and Rockies, VMAX Enginuity 5876 – and things to come), as they will remain (again, for the foreseeable future) the vast majority of the market – while other things will have more buzz. Think of it as steak and sizzle. A good meal needs both :-)
- This third observation lead to: start to re-architect Hybrids like crazy for tiering as a fundamental feature (which we did), look at ways to augment the Hybrids using Flash as a cache (which we did), plan for rare cases where the spectrum of “any hybrid” including “dense GB” all SAS configs all the way to “dense IOps” all flash use cases (focused narrow use cases where either you want no data services, or conversely specific data services like consistency groups/SRDF at scale). It lead to the huge engineering work in VNX land to start to view configurations with 10-20% flash and hundreds of thousands of IOps as “common” (which we did with MCx). You can fully expect that similar work is going on in VMAX-land.
- This fourth observation lead to: VMware going to town on VSAN, and EMC ScaleIO focusing on “larger scale” use – many nodes, and blended hypervisors/physical. If you want to get a sense of why I say this is potentially MORE disruptive than AFA – all you need to do is look at this. 200 scaleIO nodes running in AWS – driving huge bandwidth and low latency. Interestingly, this is a much, much more compelling performance envelope than just using EBS itself :-)
- … and then the fifth AFA observations – which is the critical piece for today, which lead to these conclusions:
- EMC won’t be competitive in the cases where AFA really plays by simply doing 2 + 3 (in other words, just using Hybrid array intellectual property for all flash is a losing strategy WHEN THE WORKLOAD demands AFA characteristics and data services like inline dedupe). It will be interesting to see whether (and how) other folks came to the same or different conclusions. You can look across the industry and SOME are going hard down the “our previous technology loaded with Flash is an ‘all flash array’”. Others are taking a more similar “AFA needs a ground up re-architecture”. Look at HP and HDS, and then on the other side look at NetApp. In the NetApp example, it’s interesting to look at NetApp currently positioning E-Series today as an “All Flash Array” (I would argue it’s more like an “all flash VNX”), but seemingly (?) working on Flashray as their real AFA strategy. It’s not to say that all-flash variations of hybrids aren’t valid, but IMO they aren’t sufficiently “architected differently”.
- Inline dedupe is critical, and not just the checkbox but HOW it works – because it needs to be a “basic artifact” that is always on, because it’s SO critical to the economics of AFA. Expect a lot of competitive “I HAVE/WILL HAVE INLINE DEDUPE!”… and I would encourage customers to look deeply at how people architect their variation of this – it’s central to AFAs.
- Scale-out models – and here I’m talking about real scale-out, not “managing/automating multiple units” (like we can with Unisphere and in a much more sophisticated way with pooling/abstraction/automation via the ViPR controller), or “federated” models (think of NetApp cluster mode, which still has files inherently behind a single “brain” – but can move data and the virtual brains to continuously work to rebalance), but real scale-out (inherently and automatically balanced) is very important. Expect a lot of competitive “I WILL EVENTUALLY DO SCALE OUT!”. Fundamental scale out is not a feature, it is an architectural choice. It’s harder to build up front, but impossible to do well after the fact. And, when I say that scale-out is important in this use case, I that I mean not that it is a “little important”, but a lot. Why? Because the AFA’s will be “islands” that will start small, but tend to grow. I’ll reiterate the “why” in my view… Look back at the quote in the observation on what workloads are the target: “...focused more on max IOps, per IO latency, IOps density, and $IOps.” If you have these workloads, and don’t have linearity under normal load, dyanmic load, during the lifecycle load, and frankly growing load… It doesn’t deliver what customers are targeting AFAs for in the first place.
EMC makes a huge amount of investments – and looked at almost all of the all flash startups – and XtremIO stood out head and shoulders above the rest. This is an XtremIO X-brick:
Here are are the four fundamental, unique architectural reasons WHY XtremIO is so different, why EMC acquired them, why we think it was worth the time to harden and get right, and why we think it’s the best AFA on the market:
- First: Content based data placement:
- Every IO on ingest gets a multi-stage hash value. People will poop on the possibility of hash collisions. IMO, this is competitive noise. Yes, all hash calculations (inherently, as they are a more dense representation of a set of data) involve some insanely remote probability of hash collision – but these are astronomical. If you’re worried about this, you should really be worried about more likely issues. What are astronomically remote scenarios? Well… think of the probability of a comet impact on your city, or the startup company you bought your product from going out of business, or being acquired. I’m not saying that you should worry about startups going out of business or being acquired – but if you worry about statistically astronomically remote scenarios, or someone is encouraging you to worry about that – those sort of business scenarios are WAY, WAY more likely :-)
- This means that inherently, all data is balanced across a cluster (because the hash function determines the balance on a single X-brick and across X-bricks. This (to me) is one of the architectural definitions of “true” scale-out designs (it’s certainly true of ScaleIO, Isilon, and well-configured VMAX… and more true of VMAX going forward). The video below shows this clearly – note how the distribution of “lights blinking” with a random mixed load is very evenly distributed:
- Second: Dual-Stage (and DISTRIBUTED) Metadata Engine
- This is a very important architectural point. The Metadata (the hash value/location) is a critical element in these architectures. If it’s not in memory, inevitably there non-performance linearity. If you destage to SSDs (or god forbid magnetic media), there is an apparent “downshift” in performance. BTW – this isn’t just in AFA land. In VNX, for example, the amount of metadata/memory relationship is the thing that ultimately determines the performance envelope when using pools that use FAST, that use Thin, that use Snaps, that use post-process dedupe. If the amount of metadata (proportional to the size of the pool) destages from memory to SSDs to magnetic media in a pool – the performance of the system takes a dump. I don’t feel bad about saying this – because VNX in rockies does this pretty well relative to the platforms it competes with. Interestingly, all the AFA players I can think of except XtremIO have this design principle (very “VNX like” in implementation). There’s a reason each XtremIO node has 256GB of RAM. There’s also a reason that we’re supporting 10TB Xbricks and 20TB Xbricks at launch – it’s linked to this metadata topic. For XtremIO (for the sake of performance linearity) it’s got to be in memory.
- NOW THIS IS IMPORTANT (as there will be a lot of “how does this work” from competitors) – of course this information needs to be reliably persisted, because if you don’t nail it, you have a bad day. The way one does this matters. In the XtremIO case, metadata is stored in DRAM, and the journal is mirrored across the two brains in each X-brick over a redundant IB interconnect. The journal is also stored on local SSDs in each brain in an X-Brick in the case of an IB interconnect failure. The scenario of “total loss of power”or “isolation of an X-brick” is covered through batteries to destage. As you can see – this is a very robust model with no SPOF. Why design it this way? The view was that the linearity of low latency performance (why people look at AFA) meant that in our view, keeping metadata in DRAM at all times was the right architectural approach. BTW – it’s not an “unproven” model – think of IB interconnected NVRAM on NetApp as analagous (but clearly that wouldn’t work here, because you could never get enough NVRAM), or the mirrored CMI write cache of VNXes that use batteries to destage. The key is that the system memory needs to be bigger, and the metadata journal mirroring/destage to SSDs as another protection mechanism.
- Third: A unique data protection model – XDP (Xtreme Data Protection)
- Classic RAID creates an unnecessary wear load on flash, the extra read (and more importantly write) operations for parity and mirroring have a big downside, as it accelerates the wear process on the flash media. This one of the reasons that most of the media in things like VNX and VMAX arrays started with SLC, and added eMLC to withstand those write cycles over the life of the drive.
- XDP is the protection model used – and has only a 8% overhead (important because customers pay a lot for the media in AFA), and 1.22x IO load on reads and writes.
- The upside also includes that there are no “hotspares” and that there’s a very fast rebuild (with no performance impact)
- There’s a huge difference here that manifests itself in “linear behavior always” that is worth exploring a little more. Most (?) “traditional storage models” (common in both Hybrids and most other AFA designs) is that there is some sort of data layout/log journal (something that looks like the “low level parts of a filesystem” that optimizes around finding “stripes” to dump down a bunch of data. This means that inevitably as less and less pools of space are available, there’s some background “cleaning” process, sometimes called “garbage collection”. This process, when it pops up, breaks the theme of “linear low latency, always”. The video below shows this clearly – and is a challenge for every other all-flash array in the market. The performance (latency and IOps) is the SAME when the array is empty and when it’s full:
- Fourth: Shared (and DISTRIBUTED) in-memory metadata
This is a funky, and intrinsic architectural thing. For this architecture to work, each of the nodes must share and distribute their metadata. The diagram below highlights how these work. Each XtremIO storage processor shares it’s metadata (and the two access unique user data), but use an IB interconnect for inter-node RDMA fabric. Without this sort of model (I’m not saying the specific implementation), it means that each node’s metadata (and indirection model) becomes a scaling bottleneck, and you don’t have a “symmetrical scaling model”.
Now, I’ve notice that some are scratching their heads about “why 4 nodes vs. 8 or more”. This, as you can see from all of the above, is what you would call a “tightly coupled” scale-out model. The more tightly coupled a scale out mechanic = the lower the small IO latency. Think of the shared memory model of VMAX being another example of this). The inverse is also true: the more tightly coupled a scale-out architecture = the harder it is to scale the number of nodes. Think of Isilon and ScaleIO being in the “middle” in the architectural continuum (“loosely coupled”), and things like the ViPR object/Atmos models (“eventually consistent”) being at the far end of the extreme.
With tightly coupled architectures – testing, qualifying, and coding for “more nodes” is non-trivial. It’s for this reason that at GA, up to 4 nodes are supported, but will increase rapidly in 2014 to 8.
It’s also the reason that out of the gate, there is one important thing that customer need to know. On the GA code (v2.2) of XtremIO, the code for redistribution of data is not on (targeted also for early 2014) – so customers should be thinking about their particular needs in terms of total system IOps and capacity, and look at one, two, three, or four X-Bricks.
These four architectural characteristics are ones that I believe NO OTHER ALL-FLASH ARRAY on the market does. If you look back at the criteria that (in our minds) drives the AFA use cases, XtremIO was far and away the best technology we saw out there – and that’s the answer to the first question of “why buy vs. build”, and why XtremIO was the right choice for EMC, and for our customers.
Beyond that – it gets, day one, a bunch of great EMC family goodness:
- VPLEX and Recoverpoint support for stretched active-active use cases and perhaps the most powerful remote replication capabilities on the market.
- Integration with the EMC vCenter plugins day one for strong VMware-integration (including the most facemelting XCOPY implementation on the market due to the inline dedupe/hashing model), and ultimtely via ViPR, rich integration with orchestration frameworks.
- VBlock Specialized Systems focused on VDI use cases – and to be sure, VDI is the most sweet of the sweet spots. This Vblock has the awesomeness of UCS (IMO, the best server platform for VDI), XtremIO for boot images, and Isilon for user data. This is what one of these specialized Vblock systems looks like:
So – if it’s great – why did it take time to bring it to GA?
Is it that it’s not ready? Nope. Remember – 4 XtremIO X-bricks supported almost the entire VMware HoL load this year:https://www.xtremio.com/vmworld-2013-cool-facts-about-xtremio-powering-the-hands-on-labs/
The reason why it took time was simple – we needed to get it right.
- We ran a directed availability period where customers were using it, trying it, putting it through it’s paces. many customers surprised us by insisting (seriously) that they wanted to buy it pre-general availability.
- Throughout the directed availability period, we discovered things that needed fixing in the HA functionality – i.e. “pull wires out” scenarioes. We learnt a lot in the first half of the directed availability, and fixed a lot of code. I fully expect (and see already) that there will be a lot of FUD from competitors on this note. We do encourage customers to put us through the same paces (both planned and unplanned) that our late stage directed availability customers did.
- NDU code – the NDU code is in the GA product. Like anything – you can expect us to be conservative, as we pushed really, really hard on the HA code and scenarioes, and less on a relative basis on NDU scenarioes. We’ll set expectations for a disruptive upgrade for the next upgrade, but it’s not because it’s not NDU, but because we’re exhibiting the same conservative approach we’ve taken throughout the whole directed availability process.
- Throughout the directed availability period, we discovered things in the DA hardware that needed to be fixed before shipping the GA hardware. The first wave of hardware was SuperMicro based, and had lots of issues. LOTS. The GA hardware is EMC manufactured.
- We used the time to build up the services and support organization to get ready for mass market demand.
DA customers loved it, BTW (even ones that hit the HA issues – I think there was one exception that I know about). We got a lot of quotes like these:
“There are few things in history that have a significant impact on advancing technology – I see XtremIO as one of those technologies” – Craig Englund, Principal Virtualization Architect & Sean Collier, Sr. Administrator, Boston Scientific.
Here’s one more thing to think about. When you’re a startup, your ability to get customers is “naturally gated” by your “startup-ness”, and your ramp in the marketplace. If you look at some of the startups that have the largest “volume” to date – getting to 1000 customers is a big achievement (one to be proud of), and takes more than a year, two or three of being in the marketplace. You learn a ton at the 10 customer scale, 50 customer scale, 100 customer scale, etc. Each step up is a period where you can harden and mature your product (both the hardware and the software). Note that as some of the players are learning, even at that point, sometimes profitability can be elusive.
Now ask yourself this: “how long will it take for EMC to get 1000 XtremIO customers?”. Answer is not long. As a huge company, not only do we need to think about our customers first, and our brand – but also the basic premise that there is no “let’s go small” gear in the gearbox. Once it’s GA – rapidly there are ton of customers with the software/hardware, and there is NO TURNING BACK. That’s why we’ve pounded on this so hard.
I was in Israel a couple weeks ago with the engineering and product team, and have been working with many of the DA customers. The product is ready, and the customers love it.
So… XtremIO is here, it’s global! Xpect more performance. Xpect more scale. Xpect more efficiency. Xpect more endurance. Xpect the unxpected. If you have workloads that are an all-flash array fit - push on EMC XtremIO and our partners – you will be amazed. Comments ALWAYS welcome!