Over the weekend, I saw this blog post about the disruptive XtremIO 2.4 –> 3.0 upgrade.
First of all, yes, it is accurate to call the XtremIO 2.4->3.0 upgrade a disruptive operation. When customers using XtremIO 2.4 migrate to 3.0, there are big changes, big improvements. Think 2x better performance. Think 2-4x higher utilization due to compression.
We continue to support 2.4, so if customers want to sit tight and avoid the upgrade, they are entitled to do exactly that – they will continue to enjoy all the XtremIO awesome they are loving. To get all the new goodies above (at no extra cost!) they will need to pull the data off, upgrade, and then bring the data back – and and our partners are always ready to help them do it.
All of the above is why so many customers are picking XtremIO, why Gartner put it here, and why it has become the fastest growing revenue storage product EVER.
But why is this particular upgrade disruptive? Why do disruptive storage events ever happen anymore?
Storage is persistent. This is patently obvious – but in general people don’t think through what this means (and why should they – they aren’t engineers!).
Anytime you touch two core parts of any persistence architecture: 1) layout structure layer or 2) metadata mapping layer – it means a disruptive migration to some degree.
BTW – it’s funny looking at some of the people who commented on the blog critically… who themselves as vendors are going through a huge disruptive event of their own! All the more reason to not listen to people who go negative, and trust those who disclose warts, and partner with you to work through them.
Disruptive upgrades affect ALL persistence architectures, all vendors at times. If you’re curious about the engineering reasons why (helpful to predict whether any future upgrade of any stack will likely be NDU or DU), as well as more on this particular XtremIO upgrade (and some more roadmap) read on!
The “Layout Structure” layer of a persistence stack is at the bottom – structurally how information gets persisted to some non-volatile media. Sometimes this is called the “on disk structures” – but I think this is a misnomer, because what if there is no disk What if it’s all SSD, or something else (some strange future NVRAM thing)? Regardless of HOW (electronically or electromechanically) the information is persisted, there is some sort of “structure” of how the data is stored. This can be as objects with hash addresses (XtremIO as an example). It can be as a journalled filesystem (think of ZFS as an example) or one with random, sequential pattern. Maybe it’s not a full filesystem, but only the block layout parts of one (this the example of the VNX new block layout as an example or the bottom parts of ONTAP). It can be via a layout focused on SSDs as a write cache (think of Nimble CASL as an example). It can be focused on inline dedupe with backup use case centricity (think of Data Domain SISL as an example). It can incorporate additional “soft” error checking mechanism to enhance hardware fault detection. It can incorporate RAID, geo redundancy, or all sorts of erasure encoding. This is the “foundation” of a persistence stack – aka any “storage thing”. Hard to change, heavy to lift due to “data gravity”. Mess this up, and your customers have a bad day.
When you change the underlying “layout structure” – there’s no nice way around it. Getting the new goodies means pulling the data off, updating, then putting the data back on.
The “Metadata Mapping” layer of a persistence stack is sometimes called the “indirection” layer. This is the layer which is central to most of the data services (snapshot, thin, tiering, caching, dedupe, compression, replication, information intelligence) in modern persistence stacks. This is the layer that handles an incoming request, “transmogrifies” it into where the actual information lives. This is hard to change and heavy to lift because this is complex code. Mess this up, and your customers have a bad day.
When you change the underlying “Metadata Mapping” layer – there’s no nice way around it. Getting the new goodies almost always involves pulling the data off, updating, then putting the data back on.(sometimes if the changes are light, you can do a “data in place” upgrade – only moderately disruptive or NDU via a failover process – both without without data movement. Once the changes in the metadata layer get material – NDU becomes DU.
Some storage “things” are purely metadata mappers (in band storage virtualization) and sometimes defer the layout structure to other things – but this doesn’t change the truth of the above – they too need to occasionally do a metadata structure update.
Disruptive upgrades come in a continuum from: “small period of host/VM disconnection measured in minutes/seconds”; to “data in place upgrade with extended period host/VM disconnection”; to the “data will need to be removed from the storage target, upgraded, and then copied back”.
They are all dreaded both by the storage vendors and the customers – but they are unavoidable to some extent.
[UPDATE: 9/15/14, 6:10pm ET – one person in the twittersphere, Roland Dreier @rolandreier, commented that versioning can be used to handle backwards compatibility for both the layout structure and metadata. This is, of course, technically true. It’s a little more possible generally in the persistence domain for the layout structure. Couple of comments… a) I’ve never seen this maintained successfully – has anyone else? b) obviously if you do it for metadata, you’re cutting system resources down, and for the layout structure – the old behaviors and new behaviors would then exist for two “pools”. Not bad, but again, I’ve never seen this sustained. I’m open to all feedback, and curious for any examples where an engineering team has done this – good or bad]
There is of course a huge new tool in the toolbox: VMware Storage VMotion. So long as your use case is vSphere, this can be a huge mitigating element, upgrading components, with svmotions between steps. This is commonly the technique used in highly virtualized or hyper-converged infrastructure that uses vSphere (EVO:RAIL for example).
So – why do vendors ever change their layout structure and metadata mapping models if this is painful? Well, the way you get awesome new persistence stack capabilities in every dimension: scale + performance + feature envelopes are all “wrapped up” in both data layout/distribution and metadata maps. In otherwords, if you don’t do it – your customers leave you.
Every architecture periodically does these big disruptive changes. Everybody. And mitigating via Storage vMotion (whilst awesome!) doesn’t change whether a persistence layer upgrade is disruptive at the lower level. You can tell if someone tells you that it’s non-disruptive, but only for VMs on vSphere, and only if you have a small amount of “swing capacity” :-)
Here are some examples across the vendor ecosystem where you now knowing this “why” can look back and apply the knowledge:
- EMC has a history around upgrades that goes back a long way. Some good, some bad. Look at CLARiiON and VNX – on block a long history of trying like crazy to do NDU (and mostly delivering), but periodically needing to do a data-in-place upgrade (new HW, moderate metadata mapping changes) and less frequently full “data off/on” migrations (fundamental changes to on-disk structures aka layout structure + big changes to the metadata mapper layer) like the Rockies update. We’ve been going through a multi-step big architectural shift for the last year, and still have some more to do. On the file side, it’s been very stable for a LOONG time (the base of UxFS32 has been there for a long time, so no on-disk layout structure changes, and relatively ), but we’re coming up to a big “step function” leap (VNX file is great – but in need of a big upgrade) – and expect that to touch both layout and meta-data (and therefore be disruptive).
- NetApp has a history here too, and has a long history of trying to make upgrades as non-disruptive as possible. But – when layout and metadata models shift, well – look at the huge ONTAP 7-mode to c-mode migration. Ask anyone how non-disruptive that one is! I’m sorry for going a little “negative”, but I’m a little grumpy at the NetApp employee at the blog post that triggered this post slinging “NDU is bad, EMC sux!” when they are only fractionally through the knothole of a huge one themselves and for their entire customer base. If I were them (or anyone), I would think twice before going negative. It could be you (and in his case, it IS).
- VMware has a history here – look at the VMFS-3 to VMFS-5 upgrade. There’s another one coming – where VSAN’s 1.0 (which has few data services) will go to VSAN 2.0 which was briefly talked about at VMworld, and whose beta will be starting right about now. This will involve a huge change to the on-disk layout structure and the metadata layer, but will enable a ton of new stuff. Of course, anything sitting on VMware can benefit from storage vmotion (a little more on that in a sec)
- If the startup folks want to fling poop saying “we have never done a disruptive upgrade!” or “our architecture is impervious to that!” – they too have ALREADY been through multiple disruptive upgrades :-) They happen early in the development cycle, before you get to a stable release, and then you go to market for a couple years with a relatively stable codebase. Then, you start to get long in the tooth – and WHAMMMO. Yup you guessed it – layout and metadata structure changes.
Again – right now, there are only two ways to avoid this that I know of:
- Storage vmotion (a great tool!), but of course that has limitations also (obviously vSphere only, but less obviously – it takes time and resources to do it at scale).
- object storage models are different in the sense that so long as you support the API, you can redirect the client on reconnection to a new copy of the object on a new storage stack, but you still have a “storage vmotion-like” downside of a lot of time to move mountains of data. We are doing this with large customers using both Atmos and ViPR, for example – but this doesn’t work with transactional block and file models.
Beyond those two, there is an “Apollo moon mission” project inside EMC engineering is to deliver a path for block NDU upgrades all the time, but this isn’t easy. This involves copying all the data, and then (if possible) all the replicas, and then moving the WWN/IQN. We started with “Federated Live Migration” in VMAX – but the idea is more broadly extensible. Doing this across vendors would require standards efforts (which we’re trying to start)
We’re also looking to see if we can apply the Virtual Data Mover migration technique not only in VNX/VMAX but also VNX/VMAX <-> Isilon.
Both are not easy projects – but just the sort of thing smart engineers like to tackle!
So – back to XtremIO specifically…
- The 2.4 –> 3.0 upgrade touches both the layout structure and the metadata indirection layer. It’s why we’re able to add an inline compression capability and big performance improvements on the same hardware.
- …That means that it’s disruptive to the host. This is not a required upgrade, and we continue to support 2.4. If customers want the new goodies, the disruptiveness can be mitigated using storage vmotion. EMC and EMC partners are offering to help customers through the migration.
- … The XtremIO roadmap is very rich (and amazingly, is actually pulling IN timewise… adding things like compression , versus slipping out – a testament to a great engineering team). Without committing to dates, we’ve publicly hinted a lot, and I’ll hint some more. None of these are things that we expect will require any NDU. Like all roadmaps it can change, but the team and I believe in transparency:
- On the roadmap, soon we will be able to add nodes to an XtremIO cluster and the data will all re-distribute. We don’t anticipate this will require an disruptive upgrade.
- We’ve also discussed embedding the recoverpoint splitter to deliver very robust remote replication without requiring VPLEX in front of the XtremIO cluster. We don’t anticipate this will require a disruptive upgrade.
- Much further out, we know we’ll need a new hardware platform for three reasons. We don’t anticipate this will require a disruptive upgrade.
- The first is obvious: there will be awesome new hardware available that is bigger/faster/stronger – the current XtremIO hardware is awesome, but there’s an inevitable march of progress.
- The second is less obvious: one of XtremIO’s architectural advantages is that all metadata is always in DRAM, and shared across the clsuter – this is what delivers its linear behavior always plus always on/always inline data services. The downside is that there is a relationship between the DRAM in an X-brick and the maximum effective capacity. As larger and larger NAND media becomes available, there will need to be X-Bricks with more RAM (we overbuilt the current generation to give us headroom, but eventually there will need to be more).
- The third is also less obvious: there are things we can do to shrink (physical footprint) and further harden (again, XtremIO has some of the highest customer satisfaction stats of our portfolio – so this is about going from “great” to “even better”) the architecture – withstand more failures, less system components/complexity (making installs even easier).
So – all those goodies (in 3.0 and in 3.0+ over the foreseeable future) are expected to only require this one NDU step. We could get that wrong, but that’s how it’s looking right now. We aren’t forcing customers who are happy with 2.4 to upgrade.
I hope that this helps understand – not just this particular circumstance, but more generally - and give some context. Are you an XtremIO customer? How’s it going? What are your thoughts on this topic?
"All of the above is why so many customers are picking XtremIO, why Gartner put it here"
You seriously want people to believe that both Customers and Gartner knew about this and simply suspended their disbelief in order to assist EMC with it's flash revenue push...You couldn't make this up, maybe you should tale a leaf out of Cisco's Whiptail book and suspend shipping such an immature product into production environments.
Posted by: Paul_P | September 15, 2014 at 04:31 PM
Chad, I just wanted to clarify something. Going from Data ONTAP 7-mode to cluster-mode (or cDOT) is disruptive, and does require an evacuation/migration, because it is a different flavor of the OS and essentially a different product with a different feature set.
Going from Data ONTAP cluster-mode 8.0 to 8.1 to 8.2 (and to 8.3...) is non-disruptive and does not require a migration or evacuation.
Just want to make sure we're comparing apples to apples here. Disclaimer: I am a NetApp employee.
Posted by: Christopher Waltham | September 15, 2014 at 05:05 PM
I don't think I can edit my comments, oops! I wanted to add this: going from Data ONTAP 7-mode from version 7.1 to 7.2 to 7.3 to 8.0 to 8.1 to 8.2 has never required a migration/evacuation. Just your typical HA failover and giveback.
Posted by: Christopher Waltham | September 15, 2014 at 05:06 PM
I guess that if EMC waited 2-3 years with this release and supported it only on new hardware, nobody would complain - nobody expects NDU across hardware platforms. Deliver these features earlier and without requiring new hardware, and suddenly the blogosphere is fuming.
Or maybe it's just the sound of the competition seeing their market share melting?
Posted by: David666M | September 15, 2014 at 05:17 PM
@Paul_P - thanks for the feedback. I've been transparent with every customer, and everyone I know has been as well (for as long as we knew) that 2.4->3.0 would be disruptive. I tend to not believe in conspiracy theories... to me, they fail the Occam's Razor test. There is absolutely no way would do a stop sell on XtremIO, frankly - customers are telling us they are ecstatic, and voting with their dollars. I of course have no idea why Cisco has done the stop ship, but I'm sure it's for some reason with the customer in mind. Would you be willing to disclose your employer?
@Christopher Waltham - Christopher, thank you, and I want to be clear, I have a lot of respect for NetApp and their tech (the comments on the blog were... less polished). Your point is clear - the 7.x - 8.x is NDU and uses a HA cluster failover. I **THINK** I was clear that it was the 7-mode to c-mode (which involved changes to layout structures and metadata) vs. staying in the 7-mode behavior. I hear nothing but positive from NetApp customers in 7-mode. The point was that the 7-mode to c-mode is a disruptive event (not good/bad - but rather reinforcing that major system behavior/features change comes with this transition baggage - for everyone).
@David666M - I think (personally) you might be on to something. The fuming I'm hearing tends to come from competitors and partners that resell competitors, and far, far less from customers (though a small number are unhappy as you would expect - and we're rolling out everything we can to mitigate difficulty for them).
Posted by: Chad Sakac | September 15, 2014 at 06:29 PM
Thanks for the response Chad. You've always been upfront and open with the community, and I respect that. Very few vendors can claim that.
Even the vSphere web client flies on the unit. No complaints there.
Posted by: Andrew Dauncey | September 15, 2014 at 07:03 PM
Disclosure: I work for Pure Storage and my opinions are my own
Chad,
I think that if customers ask directly, most EMC employees would not have hid the fact that the upgrade is destructive. Unfortunately, your sales teams and your website don't seem to lead with the full transparency--they only respond when pressed. For example, your website still blanketly states upgrades are fully non-disruptive. It gives no indication that this isn't covered in certain scenarios. If you were a customer and read the below, would you get the impression that major upgrades are probably data-destructive?
"XtremIO eliminates the need for planned downtime by providing non-disruptive software and firmware upgrades to ensure 7×24 continuous operations."
The problem with transparency spreads beyond this one issue. EMC is quick to list off all the benefits of scale-out and position XtremIO as a scale-out leader, but no where do they caveat that scaling out is also data-destructive.
I realize there is a roadmap to fix all things, but your website and marketing materials look like they were written for the year 2016.
XtremIO Capacity, Performance and Software scaling are all DATA DESTRUCTIVE today. And, no matter how much customers love 2.x, those items will remain destructive until they upgrade to some version in 3.x. Staying on 2.x is not an option for anyone.
Pure Storage and XtremIO were both founded in 2009 and Pure Storage has been providing non-disruptive code upgrades since GA (2011), even with significant data structure changes as new features were added. Purity's Non-Disruptive Everything architecture IS possible, if you start with a solid, flexible foundation.
Architecture does matter.
Posted by: Mike Richardson | September 15, 2014 at 08:19 PM
@Mike - thank you for your comment, and thank you for your transparency!
I have a lot of respect for Pure and what they have accomplished (and continue to accomplish). We clearly compete furiously in the marketplace, and ultimately the market and customers decide.
Personally (and publicly) I've been working overtime to always be transparent, and push for it everywhere. I will tackle some of the web content.
I've always been transparent that scale-out on XtremIO today is absolutely real, but DYNAMIC scale-out is not. As I call out in this post, the next release that adds rebalancing across a cluster (aka dynamic scale-out) is indeed coming (but not here). I also call out publicly (there's always risk when one does this) that we expect that update to not require metadata mapping or layout structures (and we expect it to be NDU).
I've called this out publicly - here http://virtualgeek.typepad.com/virtual_geek/2014/07/some-really-cool-xtremio-facts.html and here http://virtualgeek.typepad.com/virtual_geek/2013/11/xtremio-taking-the-time-to-do-it-right.html. Note this second reference (the first chronologically) that dynamic scale-out slipped out from the original thinking, but other things (compression) pulled in.
Again, thanks for the feedback. I will work on things on my side.
I would humbly suggest that there's perhaps a bit of "sample bias" - both for you, and for me.
- Customers are telling me directly that they are selecting XtremIO based on their own conclusions about linear performance, and consistent low latency under very broad circumstances.
- I can see the mix of "single X-brick" and "multiple X-brick" purchases. Customers are buying scale-out (static) NOW, because many of them have workloads (including replicas) that make non-scale-out architectures tip over (otherwise they would need to use some sort of host-based thing - or stick something in the middle which often they don't want to)
I'm sure that there are architectural things that Pure thinks are superior, I'm sure there are things that are architectural things that EMC thinks are superior - ultimately the market and the customers choose.
I've always found that customers like transparency, and HATE when vendors go negative on each other. I try to stop that wherever I can.
Again, thanks for the transparent dialog!
Posted by: Chad Sakac | September 16, 2014 at 10:30 AM
Mike at NetApp here. Nice blog Chad, I enjoy reading it. You put yourself out there, hence nothing but respect from me.
I think that one point is being missed - moving from 7-mode to Clustered ONTAP is not merely 'upgrade' in any simple sense of the word. It is a complete migration to an entirely new software platform, one with 10 plus years of development behind it and tons of new capabilities.
We worked hard to give it much of the feel of 7-mode: many commands and object names remain the same or very similar. However, this was done to maintain ease of transition for administrators who invested a ton of effort over many years to become expert. We worked hard to put WAFL in it because of the value it allows us to provide.
My take: transition from 7-mode to cDOT is much more akin to comparing transition from VNX to VMAX.
Cheers mate, keep putting it out there.
Posted by: Mike Shea | September 16, 2014 at 10:33 AM
@Mike - good to hear from you, and hope all is well!
I think that's a VERY true point.
While I don't claim to be a NetApp expert it seems that 7-mode to c-mode is: a) underlying layout structure changes that are huge (enabling what I think you guys call "infinite volumes"); b) metadata changes (enabling many new things like the whole virtual server thing I would wager); c) and deep, deep architectural changes that enable IOs to be shuttled across the pairs in the cluster.
I think over the years, NetApp has done a great, admirable job of making value in software releases available back-ported, and that 7-mode to c-mode is an "Apollo Moon mission" for NetApp. My frustration (and I should just let it flow over me) was people suggesting that others never have DU upgrades. It happens. We all should (and do) try to minimize them at all costs.
Thanks for the contribution to the dialog!
Posted by: Chad Sakac | September 16, 2014 at 12:16 PM
Full disclosure: I work for neither a storage company nor a VMware partner. Our shop is built entirely on EMC gear--millions of dollars worth.
First, thanks for your post, Chad. I think you make a lot of good points and are open about benefits and drawbacks of various scenarios as well as being informative on historical trends.
However, I think I need to articulate where much of the rancor stems from customers and partners, and I say this as a dedicated EMC customer myself.
Disruptive upgrades are one thing, but destructive ones are quite another. I agree with what you've said regarding the necessity for both of these given massive architectural changes that can bring with them great benefits and add value to an existing purchase. But let's frame your mention of those with the fact that in considering the totality of storage systems and their history of firmware upgrades, a destructive upgrade remains a rare thing, much more so than merely a disruptive one. Now consider that:
1.) EMC's XtremIO is towards the top of the market.
2.) EMC has publicly stated (as has been discussed in the comments) that NDUs are the norm.
3.) Vis-à-vis #2, this is at least the second destructive update with XtremIO in the last year.
I think you can see why this more than mildly irks some of us who run hand-in-hand with EMC. For a company that large to re-introduce this process is, in the opinion of many, unacceptable and frankly astonishing given EMC's clout and experience in storage. We expect better than this from EMC. At best, these factors combined appear negligent; at worst, duplicitous. But the real factor that matters at the end of the day--all technicalities and marketing collateral aside--is how this is felt by customers. I can tell you because of these practices with the unit, we once were but are now no longer considering going with XtremIO.
Posted by: ChipZ | September 16, 2014 at 01:14 PM
@Chip Z - thank you for being a customer, and thank you for your feedback. I understand where you are coming from. We always aim (as much as possible) to make everything NDU, always - but, as noted, there are times where one (or both) of those elements of an architecture need a change/tweak.
I genuinely appreciate the feedback, and I hope that you trust me that there is no dupliciousness at the root. Earlier, it looked like we could add the required capabilities without requiring the change - and ultimately an engineering choice must be made.
Ultimately, we look forward to winning back your trust!
Posted by: Chad Sakac | September 16, 2014 at 02:48 PM
@Chad. Thanks for your transparency within the blog.
I'm not an XtremIO customer, but a current EMC customer looking at XtremIO and others. The prospect of having to go through destructive upgrades be it for firmware expansion or adding additional storage both bother me with this platform. Our sales people recently have mentioned the disruptive upgrade 3.0 will bring, but claim it won't happen again after 3.0. As has been pointed out already some EMC marketing materials would indicate that upgrades are not disruptive. Since reality and marketing don't match up right now it's difficult for me to assume that future upgrades will not be disruptive. As you point out for the 2.4 people my company could stay on 3.0 if 3.0+ is disruptive, however there are currently some features that are on the EMC road map, which competitors have and we want, that would drive us to upgrade quickly once the features are released.
Posted by: Wayne | September 16, 2014 at 03:26 PM
Hello Chad,
Thanks for your post. Could you tell me if this upgrade would decrease the Dedup Ratio in VDI case (Full Clones Scenario with XenDesktop 7.5). With a the (new) block size of 8K, it should be no ?
Thank you,
R
Posted by: Romain | September 16, 2014 at 03:47 PM
@Wayne thanks for being a customer, and thanks for considering our offer in this space. You won't be disappointed.
I understand the uncertainty that this upgrade may bring. The brutal truth is that no engineering team can ever fully know what the future may bring - but good engineering teams are pretty good, and I have a lot of confidence in the XtremIO team.
The middle section of the post where I outline the near term deliverables and that we don't anticipate them needing a disruptive upgrade wasn't a guess :-)
I personally walked through the roadmap over the next 2 years with the engineering and product team, and left feeling quite confident that this was the last DU required to set the stage for all those elements (dynamic scale-out, remote replication, and even the future haswell and post-haswell hardware refreshes).
Hope that helps, and let me know if you want to talk to the engineering and product team yourself (so it's not just a dialog with the field team)!
Posted by: Chad Sakac | September 16, 2014 at 03:52 PM
Hi Chad,
Andre here (Nutanix employee). I have to agree with many here that 'Disruptive upgrades are one thing, but destructive ones are quite another'. If needs to be done for the benefit of the customers and platform, it needs to be done, and I understand it. This type of situation can happen to any vendor and having been in engineering I know that some decisions are very tough.
However, one should not try to sell it as if Disruptive Upgrades, specially destructive ones, is something common nowadays, nor something that should be tolerated. Doing this you are saying that NDU is non-existent and has no value for any platform or businesses.
I hope you guys can cross the chasm.
Posted by: Andre Leibovici | September 16, 2014 at 05:14 PM
Another EMC customer here and I think Chip Z was spot on. It's tough to see this destructive upgrade as business as usual for EMC when considering history with its other product lines. To me it clearly seems like a trade-off to get the new functionality quickly to market on existing hardware without additional engineering effort to make it non-destructive or non-disruptive. Either that or some significant architectural challenges that one would think could have been uncovered after acquisition but prior to GA (that did take a while).
For customers that don't require compression or can't suffer the destructive upgrade there should be continued 2.x updates to provide other enhancements added to 3.0. I doubt EMC will take this on and would likely encourage the vMotion solution by making it financially feasible (no firsthand knowledge of this actually happening). Problem is that it's still a painful exception and your customers are going to remember this about XtremIO and EMC even without your competitors reminding them.
Kudos for trying to get out in front of this and the excellent technical explanation of underlying challenges but I think damage has been done.
UE
Posted by: UncleEliot | September 16, 2014 at 06:01 PM
Great article Chad! I would just add that those concerned with disruptive storage upgrades have not properly architected their infrastructure and should look to adding EMC VPLEX. VPLEX enables active/active data centers with zero downtime. Allowing any kind of upgrade to the storage array without an impact to production workloads.
Posted by: RobSteele | September 16, 2014 at 06:47 PM
Hi Romain, Itzik Reich (XtremIO field CTO here)
version 3.0 will actually allow you to consolidate even MORE vdi vm's on any X-Brick, this is because we will now support even larger logical capacity + the benefits that compression bring to the table, prepare to be amazed!
Posted by: itzik reich | September 16, 2014 at 10:25 PM
Hello Chad,
I had the pleasure of evaluating XtremIO along with several other all-flash arrays earlier this year. I personally am a fan of the XtremIO architecture and can confirm that XtremIO can deliver rock solid latency throughout a large variety of workloads. But I have always felt as if the product was rushed out the door in order to compete with the likes of Pure Storage. I don't blame EMC leadership for doing so, but it doesn't change the fact that perhaps XIOS could have used some more development time before being released into production. To clarify, XIOS is rock solid from a stability standpoint. But knowing the product roadmap, couldn't some of the necessary structures been put in place prior to going GA to enable these feature adds in a non-disruptive manner?
Posted by: Shaymus | September 16, 2014 at 10:44 PM
@Itzik - thanks for the comment/response for Romain. I've seen the data your referring to (overall efficiency that INCREASES with the 3.0 changes), I suggest you share them publicly via your blog (and I will link to them)
@Shaymus - thank you for being a customer, and a fan. Short version - we thought we could get to what was needed in the roadmap (biggest one is dynamic scale-out, but also compression) without changing the on-disk structures. Versioning metadata is possible (but not problem-free), but on-disk structures are what trigger this sort of upgrade. We weighed it HEAVILY.
Customers are onboading at an enormous rate (for all the reasons you like). We are bringing on more customers in a quarter than the startups do over a year, and in some cases their full life. XtremIO is amazingly popular, and this difficult decision will let it be even better.
It's for that reason that we decided to bite the bullet sooner rather than later. BTW - we built into our plan EMC services dollars to help customers (including swing hardware if needed), to make sure we were thinking of "customer first".
I (and I think I can say the larger "WE") agree - this needs to be IT when it comes to DU :-)
Posted by: Chad Sakac | September 17, 2014 at 02:23 PM
@Andre - thanks for the comment, and hope you are well!
I don't mean to imply that disruptive upgrades are commonplace. We ALL strive for NDU - always. My comment was more along the lines that there are two things that tend to be difficult (in many cases impossible) to change without some disruption - metadata (and yes, you can use versioning, but this introduces other considerations), and the most tricky is changing layout/on disk structures.
When you're architecture enables you to move the workload non-disruptively (think EVO:RAIL, or other common-modular building block appliance models - like Nutanix and Simplivity) there is the option of automating a "swing" upgrade style - but if the workloads aren't VMs, you don't have that choice.
You are DEAD right that this happens to everyone - and I appreciate your humility (which I try to maintain). All day long I've been getting poop slung by vendors attacking (far less from customers - but some there too - see thread). Nutanix themselves aren't immune (and frankly I think you guys do a good job!):
http://www.reddit.com/r/sysadmin/comments/1rkxoj/anyone_using_nutanix_virtual_computing_platform
(not focusing on Nutanix - this happens to us all - so humility matters!)
Look - this stuff happens to all of us. Our duty is to the customer, and that duty means working for availability as mission critical and DU as terrible - which needs to be offset by some pretty big considerations (in this case: do we bring compression and better performance to all customers, or wait for a hardware refresh with more metadata space, associated with a change of on-disk structures - which will suck for customers who bought the current generation HW).
I'll say one thing more: we planned a ton of free services (and swing hardware!) to help customers and partners that need it - at NO COST TO THEM. The customer gets a better, faster solution - for free. This is "customer first", from top to bottom.
Plus - at the point where we are growing at an INCREDIBLE clip (think 2x the other players in this space), the longer we held off - the more we would be hurting the customer. We are absolutely across the chasm from a tech adoption standpoint - but clearly need to cross this upgrade chasm...
Thank you - hope to see you at VMworld Barcelona?
Posted by: Chad Sakac | September 17, 2014 at 06:45 PM
I can speak from my experience as an EMC customer using XtremIO, but my sales team was very direct about the reality of upgrades currently being destructive prior to my purchase. When it was time to upgrade it was as easy as a storage vMotion during low usage time and a quick XtremIO upgrade. It was a non-event in our environment.
That being said my sales guys offered loaner gear to assist with this if needed - which it wasn't. It does suck that it isn't an online upgrade which we've all gotten accustomed to with most EMC array upgrades - bot for me the benefits outweigh this negative.
Posted by: Cincystorage | September 17, 2014 at 09:49 PM
Hey Chad,
I have always respected your views and posts.
Can i ask that given how well respected you are in the industry that you use your significant clout inside of EMC to change your Slides and Datasheets so that can no longer mislead customers and prospects? If its a disruptive upgrade then dont pitch it as non disruptive.
It is all well and good to write blogs pleading transparency but making change to the website and collateral does more than a blog that most people don't have an opportunity to read.
Posted by: Paul Sorgiovanni | September 17, 2014 at 11:59 PM
Hi Chad, thank you for the informative (as usual) post. Also, thanks to all the posters whose comments were thoughtful and respectful. It actually made the discussion useful, rather than annoying net noise.
Posted by: Aaron Lewis | September 18, 2014 at 11:57 AM
@Aaron - thank you!
@Sorg - working on it with the product marketing teams. BTW - how would you propose we do this if, indeed as we fully expect, this is it for the foreseeable future when it comes to DU?
Here's more detail (I also commented on El Reg to the same effect):
Our 100% focus has been the customer through this. Again, I'm not going to convince any haters, but this is the essence:
a) We have a rapidly (!) growing user base of XtremIO.
b) We had features ready to go we know would require more metadata than we support at the current capacities in the current generation of hardware. These features provide real customer benefit: compression which increases net data reduction, and performance improvements.
c) The internal debate was long, and hard. Should we deliver the new capabilities to the existing install base, or wait for the next hardware rev, and make them only available to customers who buy future hardware with more RAM?
Sidebar: XtremIO's architecture of always storing metadata in DRAM (vs. paging to on-disk or storing on SSDs - both of which make system behavior tend towards non-linearity) is an important part of linear behavior always with all data services cranking (a platform strength). Conversely, It does mean that total X-brick capacity and features is directly related to the DRAM capacity + the on-disk structure (which relates to the amount of metadata).
People can (and are absolutely entitled!) to second-guess our decision. We decided the right call was to:
1) make the capability available to all (existing customers and those in the future) - which requires an persistence layout change.
2) to do it quickly, as this is a very (!) rapidly growing installed base.
3) to ensure that this change would carry us through all upcoming roadmapped releases.
4) to build into the plan (seriously - we have done this) budget for field swing units (capacity to be deployed to assist on migrations), as well as for EMC to absorb the services cost + wherever possible help via non-disruptive svmotion at scale (where the workloads are vSphere VMs)
5) to commit to support the happy 2.4 customers (which are legion) for years to come if they want to stay there..
This is the first disruptive upgrade (in spite of some of the other people comments - other "remove data and reload" was in the directed availability period) of GA code.
I agree with all who say we should have changed the on-disk structures prior to releasing the GA code - that's 100% on us.
That all said - I'm proud of how the company that I work for is dealing with this: actively, quickly, and with the customer front and center in the thinking. The customers (for the most part) have been proactively talked to by their field teams and partners. I'll also say that the XtremIO platform itself is a rock, and customer and partner feedback is VERY positive.
People will claim that "oh, Chad's just spinning this" - nope. I sat at the Business Readiness Review Board (the exec forum for looking at all major upcoming platform events/releases), and this was exactly the internal discussion.
We should have updated the marketing collateral at that point. This wasn't carefully planned deceit - but rather something more mundane - a miss.
The question then becomes, when CAN we put up NDU as a platform feature? After all - after 3.0 is released, that is expected to be it DU wise for the foreseeable future. I suppose it's up to us to prove (by hitting the 4.0 and 5.0 releases over the coming years) and delivering other critical features (like dynamic scale out, integrated RP splitter, VVOLs, etc) using NDU going forward.
Posted by: Chad Sakac | September 18, 2014 at 11:13 PM
Hey Chad,
Thanks for the detailed response. I"m not going to write an essay in my response. Personally you don't need to justify to me that XtremIO has customers and that it has awesome coolaid.
The answer is pretty clear. If it's disruptive change the slides until such time that it's not. The fact that EMC is actively hiding this fact in it's product slides and collateral is completely unacceptable and i think that is why there is so much discussion.
like i said, appreciate you calling it out on your blog but actions speak louder than words and that goes for all vendors too.
cheers
Posted by: Paul Sorgiovanni | September 19, 2014 at 01:58 AM
Another EMC customer here looking at XtremIO and comparing against VNX and VMAX we have in house. While XtremIO seems to be have better uptime than VNX, we are on the edge of whether XtremIO has the uptime that we are striving for and have with the VMAX. Especially with seeing that a node failure causes a brief stop in IO, which I haven't seen if 3.0 helps this at all. Like everyone else I definitely appreciate the blog post on this since it could save storage admins a lot of work by waiting for 3.0 to come out instead of implementing a few weeks before. While EMC will provide resources for times like this there is still lots of work involved and production applications will still need maintenance windows for a change like this even if there is swing hardware. In conclusion we were hoping the uptime of XtremIO was at the level of the VMAX and it doesn't look like it's making the cut yet.
Again thanks for the blog post on this. If it weren't for you we very well may have had a big surprise when reading the release notes a month from now.
Posted by: Garret Black | September 22, 2014 at 11:50 AM
We own an XtremeIO array but have kept it for testing only until the 3.0 update comes out. Once it is in production we will really put it though it's paces, but for now it is just helping keep the rack warm.
Posted by: Alan Wren | September 22, 2014 at 12:09 PM
Hi Chad,
There sure are a lot of haters out there! As I mentioned on a couple of the other blogs on this topic, there is no vendor that can claim 100% perfection when it comes to firmware updates and/or datasheet stretches (stretches being a kind word). That is why it is so important for customers to do their own due diligence – trust, but verify.
In the spirit of transparency, I work for Load DynamiX, a storage testing company that actually partners with EMC and many other vendors. Prior to working at LDX, I have worked in the storage industry for several vendors and have seen my fair share of both screwy FW glitches and datasheet creep (two completely separate issues, but very interrelated).
I think some folks just don’t think through how enormous either of these issues can be. While a few instances might be chalked up to negligent or unscrupulous individuals, the majority of the problems stem from the sheer size of the moving parts. From R&D -> QA -> Manufacturing -> Interoperability. And that is just the FW part of it. Then the datasheet aspect involves now going through marketing. Every part of this process is run by people, who are all fallible, spread across companies and the globe.
So I completely agree with your statement that no vendor is without fault to some degree at some point. What is important is how each vendor handles their problems. I think you guys are doing a great job of trying to get in front of this and that EMC was especially kind in deciding to put huge feature upgrades into a FW update. I imagine there are at least one or two people back at HQ sitting with a smug grin on their face, saying, “See – told you we should have just put in the next gen product release!” I am certain many of your customers are going to be very glad though that EMC didn’t go that route.
In fact, I am already getting calls from many customers asking us to help test the upcoming XtremIO 3.0 release. It seems the performance boosts are intriguing, but it really sounds like the compression benefits are gaining the most interest. Good thing is – we can help test both. Like I said - trust, but verify!
Posted by: Kalenx2 | September 22, 2014 at 01:53 PM
Hi Chad,
Long time visitor of the blog, saw the post and figured it would be a good time to contribute.
From a pure business standpoint disruptive upgrades are painful, there is no way around that, but you also have to weigh the benefits. We have been testing a 20TB XtremIO brick for about two months, knowing that the 3.0 upgrade was disruptive we went ahead and got on the 3.0 beta program. After initially doing some tests on the 2.4 code, the upgrade to 3.0 gave us some pretty awesome compression and de-duplication numbers. We were able to take a ~16TB Oracle database, slap it on XtremIO and it only physically took up 8TB. That to me is a good enough reason to move to 3.0, disruptive or not.
We also learned that scaling your XtremIO cluster (at least for now) is also disruptive. Knowing that we would outgrow the single brick, we struck a deal with EMC in where they would essentially loan us a brink so we could start out in production with a 2 brick cluster, I think it is called OpenScale. All in all, I can live with the trade off.
Posted by: DTrain | September 23, 2014 at 04:48 PM
Disclosure. EMC employee, former NetApp employee
The objections mentioned above about Clustered Ontap being a different OS are not invalid. Clustered OnTap is a very different beast than OnTap. So was classic OnTap completely without disruption as it relates to upgrades? I will provide 2 examples where it was not without disruption.
1. Before the Aggregate and the FlexVol there was the Traditional Volume. Now, if one wanted to keep one's data in the Traditional Volume after an upgrade I guess the upgrade was non disruptive. BUT if one wanted all of the features associated with the FlexVol and Aggregates, and they were numerous as well as compelling, one had to migrate data from the tradvol to the flexvol using a selection of several copy methods. That was disruptive and there was never a conversion process. -circa 2005
2. To be fair this next example was remedied in a later release. But when OnTap first changed from 32 bit to 64 bit, and one wanted to take advantage of the larger aggregate and volume sizes for a host of reasons, initially that too was disruptive. One had to use one of several logical copy methods to migrate the data from 32bit volumes into 64 bit volumes. Again, with time an in-place conversion was released as part of a later OnTap release. But that was not initially the case, if memory serves me correct, for well over a year's time if not longer. -circa 2009
The path to Clustered OnTap as a viable option for most users began last year. circa 2013
So back to the point of the post. Changes to layout, metadata or otherwise, generally cause disruption if you want to take advantage of the new features said layout enables.
Posted by: Brian | September 24, 2014 at 02:23 PM
@Rob - Putting VPLEX Local in front of any array for the purpose of protecting yourself from an array outage is a fantastic idea. I've done it. It adds a huge layer of complexity, however, that many people are not aware of. It is also a very expensive initiative.
Posted by: AaronEatsButter | November 08, 2014 at 10:20 AM