-----------------------------
[UPDATE: 3/9/2014, 7:19pm ET] Phew – this post has triggered lots of feedback. I’m a big believer in the power of being open, transparent, always. It’s the best way I know to share, to learn, to communicate at scale. A part of that is that inevitably, people read into your words and project their own viewpoint.
Sometimes being public means readers subscribe an intent to you that you don’t… which sucks.
Some have felt that this post was in some way patronizing to VMware (“Father EMC talking down to VMware”), which was absolutely not in my heart as I wrote it. In fact, throughout, I tried to capture the fact that VSAN (and software-only stacks from within EMC and more broadly in the industry) are disrupting the technologies and business models of incumbents (EMC included!) – and those who don’t adapt (in spite of risk to established business models) will be in trouble. I think this is one of the most powerful things in the “Federation” business model of VMware/EMC/Pivotal/RSA – each being free to innovate and disrupt, in spite of how the other parties may feel about it, and also free to collaborate as partners. The biggest risk in high-tech is not disrupting yourself, and the Federation model is a “risk absorber” for this (though sometimes frustrating).
To VMware folks that read any message that sounds “patronizing” message into my words, look into your hearts, wipe your minds/hearts, and re-read. You might just be “projecting” something onto my words that are not there. You are the 800lb gorilla now (with all the good and bad that comes into it – many hate the 800lb gorilla no matter what). Fear (and fight) any arrogance you find in yourselves and within your ranks. EMC still pays the steep price for the arrogance we exhibited in the early 90’s where we had a similar market position as VMware does now. When we still exhibit that arrogance (though far in the past, and now rare – occasionally it still manifests) – it is bad and inevitably hurts us. Beware – down that path lies the dark side. That’s not you.
More often, and more importantly, being public means readers correct you, and add to your understanding, which ROCKS, and more than makes up for any bad from the above. In particular, Duncan, thanks for your comments, Frank also.
I’m constantly learning – and feedback on the topics of kernel space vs. user/guest space storage stacks was very enlightening. People also gave me feedback on VSAN specifics, correcting some things I didn’t have right. Some people gave feedback that I still disagree with, and will note.
I’ve updated the blog to the best of my ability (may hurt “readability”, but will be a more accurate picture”) – and IMO, that’s the power of “be open, be transparent”. Please enjoy!
Also – I’m technically on vacation this week, and my beautiful wife and children will rip my head off if I don’t put down my laptop and surf :-) Comments always welcome – but don’t be angry if I don’t post/comment for a few days :-)
Chad
-----------------------------
As always – this blog (and the thoughts and opinions expressed) are mine, not an EMC blog. I reiterate this because there will likely be some things in here some of my colleagues will likely disagree with – and I’m sure will conflict with some marketing/positioning. Consider yourself warned :-)
Today brings the official launch of VMware Virtual SAN (VSAN) – with the GA date following extremely shortly. IMO – this is going to end up being a “big thing” relatively quickly.
I really want to say in the strongest way I can – I want to offer my personal congratulations to the VSAN team that has brought this to market. For those on the inside or close by – it’s been a long journey (but not abnormal for a distributed persistence layer – one of the harder things in the world to engineer – and I’ve done startups that do that – it ain’t easy). By my reckoning, it’s been about four, maybe five years of hard work from first ideas to reality (the idea of VMware having a native persistence layer dates back to the Diane era). In the last year, the team has been working their tail off – and they should be very proud. They’ve made something that will disrupt the industry, and delight customers.
As with NSX (which made VMware a networking vendor by any way you want to measure it), VSAN makes VMware a storage vendor. I would argue that frankly they’ve been a storage vendor for a while (VMFS is filesytem, aggregates and abstracts block storage devices + vSphere replication + data services like thin, snapshots, linked clones provide a value not dissimilar to storage dedupe and they’ve been in the policy business with VASA and SPBM for a while). Ditto with networking (distributed vSwitches, increasing vSwitch functionality matching some of what customers expect from L2 switches, and frankly less successful past attempts to deliver L4-L7 services).
But – as with NSX, VSAN crosses some sort of “Rubicon”. VMware now participates with an offer in the actual persistence layer itself. In a very real sense VMware is now in competition with the storage ecosystem. More correctly VMware is now in “coopetition” (partnering and competing at the same time) with the storage ecosystem. I think VSAN will disrupt the whole ecosystem. VSAN will disproportionately impact ecosytem players that are: storage only vendors, hyperconverged storage stacks, low-entry (not because it doesn’t scale, but because of economics and adoption), and VMware-only storage stacks.
Rest assured, you’ll start to hear a lot of people talk about what’s bad about VSAN – always a hint that something is going on :-).
So, if there will lot of positioning by both “traditional external storage vendors” (I’m sure there will be from EMC in the field – though the team is working to make sure we support), and positioning (probably some OVER positioning) from VMware (you can see some of it here in Chuck’s blog) – where’s my head?
Here’s a quick summary of my personal opinion:
- It IS NOT an accurate statement to say that VSAN is “better” or “performs better” because it’s embedded in the kernel.
- It IS NOT an accurate statement to say (as a general statement) that VSAN is lower CAPEX than external storage – though it IS accurate that it offers a compelling CAPEX picture in many use cases.
- It IS an accurate statement that VSAN is a quantum leap in simplicity, integrated management and VM-level operations/management.
- It IS an accurate statement that VSAN is an example of a “hyper-convergence” architecture – and these architectures can be compelling in certain use cases.
- It IS an accurate statement that VSAN is a great new option for a lot of customers.
Read on for my 2 cents, and some explanation of my (I’m sure incendiary to some – not my intent to offend) statements!
OK – some of my comments in this post will be techno goobledygook without reading (or at least scanning) this blog post on storage software architectural models (Type 1 = Scale up clustered; Type 2 = loosely coupled distributed; Type 3 = tightly coupled distributed; Type 4 = shared nothing distributed). Read and come back…
Ok – ready? Let’s tackle those personal opinions one at a time…
“Embedded into the Kernel is the only way to get good performance”
As a reminder, my “summary” opinion on this topic: “It IS NOT an accurate statement to say that VSAN is ‘better’ or “performs better” because it’s embedded in the kernel.”
This one is frustrating for me for three basic reasons: 1) it’s wrong (and hurts VMware – though not sure if they see it); 2) VMware should really open up the kernel a bit more; 3) it’s not necessary to make the claim to make people feel good about VSAN performance. Let’s tackle these one at a time.
[UPDATE: Reader – this section has some topics on which I was technically incorrect – and some where further detail/exploration is necessary, I will elaborate. Also read Duncan’s comments attached to the post]
Baseline: what are we talking about? All other storage stacks used as software-only data planes in vSphere (whether “Type 1” like Nexenta or VMware’s first VSA attempt or “Type 2” like ScaleIO, Lefthand or Nutanix) generally run in guest world – ergo as VMs, not in the most privledged kernel space of the vmkernel itself. A lot of hay gets made over the “convoluted path” that an IO needs to traverse to get in and out of the VM that hosts the storage persistence layer, and then in turn present the target back to the vmkernel via either the block stack (and get used as a VMFS device) or the NFS client.
This resonates on some basis as it’s obviously unnecessarily complex (down, down - through vSCSI and vmkernel - , up, up, down, down, then up again – ultimately presented as NFS or VMFS). Well – it **IS** unnecessarily complex, but VMware doesn’t allow kernel-loadable modules except in very narrowly proscribed ways (used by the pluggable storage architecture for multipathing or the Nexus 1Kv as examples).
This is the very basic reason why ScaleIO has a kernel-loadable module for Linux kernels (used with KVM, Xen) and Windows (used with Hyper-V), but not vSphere (where it requires a virtual appliance model – with the corresponding “convoluted” IO path).
Frankly, I’m SURE that many in the industry would like to create kernel modules (I know the ScaleIO team would) – but unless you do it via unsupported means (and there are examples of this – which I’ll leave nameless to avoid speaking ill of others) – you’re SOL from a support standpoint, and VMware could change this at any given time (as they are undocumented interfaces). VMware has shown no interest in accelerating kernel-level extensibility.
The counter-argument that gets trotted out to opening up the vmkernel is that doing so will make the vmkernel less stable or secure (which would be “B.A.D.” to be sure) – but I’d point out that linux distributions have shown that pulling this off well is fundamentally about good engineering. The VMware approach (keep kernel closed) makes innovation in the ecosystem in kernel space for 3rd parties (and we’re a pretty close partner!) a non-starter or at best ridiculously hard. I can’t help but feel that there’s an opportunity for a better balance here – and if VMware doesn’t embrace this, their ecosystem will struggle (and so will they as innovation will shift away from VMware).
[UPDATE: Some have pointed out that there are 3rd party kernel loadable modules, like Pernixdata FVP, and there is a relatively new program that provides a path for partner-support models (Partner Verified and Supported Products, which you can see here: http://www.vmware.com/resources/compatibility/vcl/partnersupport.php This is good, but IMO, insufficient. I wonder how much progress Pernix (which is awesome) could have done without the intimate connections that Satyam (great guy and brillant) had into the bowels of the vmkernel. The other cases are mostly around kernel loadable driver modules – which are pretty rigid and narrow. There is just not the degree of “kernel extensibility” (even basic IO stack filter models) that could enable the ecosystem as a whole to innovate. This means that you get more “hacking” at things, and weird support models (and risk of deprecating extensibility hitting customers/partners). I think the world of brothers and sisters at VMware – but plead for them looking at the ecosystem as a “help” rather than “burden” on this topic, and extend extensibility of the kernel]
…So what about the performance claim? I don’t know if anyone has pointed this out to the marketing folks – but they really shouldn’t tear the “storage IO in guest worlds” models down… Why? Because they are suggesting that they have a poor IO path efficiency in general (a misconception people have struggled for years to correct). This would apply to any IO traversing the IO stack – which would be VERY problematic for all kinds of high load IO workloads in VMs. If I were VMware I would not have people question the vSphere IO stack efficiency – which frankly, is pretty stupendous and works great for high bandwidth, and low latency IO loads.
Furthermore, I know a lot of customers using ScaleIO, using Nutanix, using other “guest world” IO stacks (Recoverpoint virtual appliances), and in general there are things they like (flexibility, capabilities) and don’t like (management complexity) that have NOTHING to do with performance.
[UPDATE: Lots of very good dialog here, and lots of learning on my part (which is awesome and the best part of my job – learning new things). This is a more NUANCED topic.
- One error on my part was to look at “performance” through the lens I usually do in the “external” world – where performance is mostly about latency, bandwidth. There’s another critical dimension of kernel mode vs. guest world here – which is about host resources. People pointed out correctly that guest world storage virtual appliances tend to be ESX host resource hogs (4 vCPUs and gobs of RAM are not uncommon). As I looked at this across the EMC IP portfolio currently available as Virtual Appliances(like a Recoverpoint vRPA) – this is true. Looking at the VMAX and VNX VMs behind vLab (our vCAC/vSphere/Vblock powered “cloud lab”), it’s also true. ScaleIO’s kernel footprint and resources (when installed on Linux and Windows) is indeed smaller than when used as a VM on vSphere. This is ONE of the reasons why the EMC (and VMware) position is summarized as “VSAN for vSphere-only, ScaleIO for heterogenous”.
- People also pointed out that things that do a lot of really low latency work (like the Pernixdata FVP caching example) and those that have larger memory footprints (10s to 100’s of MB) vs. a “basic pass through IO” (which are more like a “normal guest IO workload”) have a marked dependence on kernel-mode models. So perhaps the comment is that the vmkernel vs. guest world debate has a “it depends” aspect to it depending on the specifics of what the IO stack does.
- I **SUSPECT** that this might be tied to context switches caused by IO, particularly when memory objects are large. Context switches aren’t always triggered by user space/kernel space mode changes (they haven’t been for a while), and hypervisors and modern Intel VT and AMD-V chips handle more and more without needing a context switch – BUT there’s no escaping it for certain things – in fact in IO land pretty often. Context switches take eons in compute time (microseconds vs. nanoseconds). Humans are bad at numbers and math. If something is operating on nanoseconds and you need to take a 100 microsecond context switch, it’s analagous to a process that takes occurs on second-level timescales needing to wait 69 DAYS to come back to activity while a context switch occurs. If you want to learn more about context switching, I found these articles to be FASCINATING learning for me on Saturday: http://blog.tsunanet.net/2010/11/how-long-does-it-take-to-make-context.html; and http://www.cs.rochester.edu/u/cli/research/switch.pdf.
- I’ll be playing with ScaleIO (which I’ve played with native as a linux kernel loadable module, not as a VM) and VSAN more in the lab (beyond the beta), but since these days I don’t get too much time to play (except for the Christmas holidays), would love to see people’s testing data:
- If you’re a customer of one of these “guest world” storage stacks (Nutanix as an example) – how many resources does it take? Are you happy with performance (latency, bandwidth, IOps) AND resources required? If you have the VSAN bits (and why not give them a whirl) – would love to see how your testing compares, particularly on this topic of resource footprint, as well as IO latency.
- If you’re in a position to take some measurements of the context switches/syscalls of guest world storage stacks in vSphere 5.5 - I would love people’s input, please comment. I’ll get to it at Christmas, and share what I find :-)
- If you have a technical perspective, please comment!
- BTW – this only (again, my opinion) only STRENGTHENS the argument that VMware needs to find a way to enable 3rd party innovation in the vmkernel.
So – let’s stop that “only in the kernel=performance” FUD, shall we?
[UPDATE: This isn’t FUD, and I shouldn’t have used that word – which is like spitting in someone’s eye in the tech industry, because it suggests that counter arguments don’t have technical merit, which there is great technical merit in this discussion].
OK, so there you have the answer to why it’s wrong, and why VMware should open it up. Why do I describe this “only kernel = performance” as unnecessary?
The first reason I say that is because VSAN is a great performing IO stack – regardless of how it’s architected. It’s pretty easy to benchmark, and many already have. It’s a very low-cost way to get pretty low latency, high IOps VM storage!
[UPDATE: the point above still is the punchline :-) ]
The other reason is more important. The thing that is truly different about VSAN is the fact that it has a VM-level awareness that is linked to VM HA behavior, and today, that is REALLY hard (frankly impossible) to do without vmkernel-level integration (will change when vVOLs are released – one way too look at the VM-level policy behavior of VSAN is a prelude to a similar option to the open storage ecosystem).
This is important to understand and dig in a little more.
In general, “Type 2” loosely coupled scale-out storage stacks spread the data around across nodes for performance **AND** protection principles. I don’t know the Nutanix NDFS internals, so I’ll use an example closer to home – ScaleIO uses this model (I bet Nutanix uses a similar “go wide” model – would welcome input/comments from people who know more about NDFS internals). In the case of ScaleIO, data of any given device/LUN is segmented into 1MB “slices” widely distributed across as many nodes as configured (with a mirror on a different node). This is inherently widely parallel, so delivers crazy aggregate performance in terms of bandwidth.
When you have this “distributed data model” you inherently have some “internode” communications – but the key is to keep it pretty minimal, otherwise latency climbs.
In the ScaleIO case, each client has it’s own local map so reads to the node holding the “slice” (or it’s protection peer if the first node is not alive) are direct. Writes only have an extra mirror write (“whole nodes” don’t mirror – slices do – so all the data and protection copies are all “randomly distributed”, or “shotgunned” or “blammed” across all nodes) and then a confirmation of the committed write. Other than that, there only each node’s “keepalive” periodic token sync – but that’s not in the IO path itself. Pretty lightweight, quite low-latency (a few hundreds microseconds). Not low latency enough for ALL workloads – but certainly for many!
Distributed stacks use some degree of caching to avoid unnecessary node traversals as much as possible.
Suffice it to say that every distributed storage stack does the particulars differently – but note the that “design center”: be distributed for any given “device” (LUN/filesystem, whatever). You can cache things for “proximity”, but the data is distributed (and that has ups and downs)
VSAN does this very differently.
[UPDATE: There are some things below that are also not entirely correct, corrections noted]
The design center of VSAN leverages the fact that it has awareness of VM objects as persistence structures, and works to keep VMs running and using a persistence layer that is LOCAL to the compute instance. Further, it interacts with VM HA behavior to work to failover VMs to nodes that happen to have protection copies of data. [UPDATE: the VM storage policy has an idea of “mirrors” and “stripes”. A VM can have up to 3 mirrors. Reads are serviced from multiple mirrors, and you CAN have multi-host stripes, so there is some parallelism (though notably less than in the “blam it all over” technique).
It’s notable that “performance” means different things – bandwidth – used on rebuilds - favors “blam it all over”, maximum total IOps against a given device favors “blam it all over”, latency favors “keep it together”).
Stripes can be across disks in a host, or across hosts – up to the maximum stripe count – which I believe is 12(unlike the ScaleIO example where the pieces of any given data are “shotgunned” across all the hosts).
The flipside is that this is one of the reasons that VSAN’s scale target is “vSphere cluster scale”. The first reason is that as a VMware-only storage stack, having a scaling target larger than a vSphere cluster makes no little sense. The other is that there’s a lot of internal logic about placement, linkage to VM HA, to data redistribution – which are linked to vSphere cluster concepts.
Conversely the ScaleIO way (and I would suspect NDFS as well) is a bit more “basic” in a sense (distribute wide, keep copies, cache locally) – but this model scales upwards into much larger node counts (we’ve shown ScaleIO clusters with hundreds and even thousands of nodes). Yet, conversely ScaleIO will be much harder to engineer “VM level policy” because of this – it means that the policy would need to be widely distributed. [UPDATE: see note above about keeping data “together” (but not to a localized to a server) for the VSAN example – and imagine how that makes certain VM-level behavior more possible]
I want to be clear – I’m NOT saying that VSAN intrinsically doesn’t scale – in fact, I think it scales great (32 nodes, PBs of data, millions of IOps is very impressive, and covers a huge swath of use cases) for it’s design purpose – but the design center is not “hundreds of nodes” (where the maintenance of distributed metadata and lots of IO mirroring and redirection would get hairy), and rebuild performance would start to become more and more important.
The other factor here is that the performance of a given VM using VSAN is the performance of the local node (which given SSD “write absorption” can be very good). But – in the case of ScaleIO (and I would wager Nutanix, Simpilvity and other models), the performance would be that of every contributing node in aggregate (so the bandwidth envelope and total IOps for a given accessing workload can exceed that of a node [update: beyond the reading across mirrors that VSAN does])
Which of these is better?
Frankly – people like us (you if you’re still reading :-) love to understand and debate this level of detail. For the market that VSAN is targeting, it doesn’t matter. VSAN’s strength is it’s simplicity and VM-level behavior and more than good enough [UPDATE: I want to emphasize that VSAN performance is exceptional, but not it’s main point] performance behavior. While everyone has been working hard to plug into vCenter, integrate the Arrays into the APIs (and work towards vVols) – and In fact VSAN has (IMO) far and away the simplest management model for storing VMs for large chunks of the market (where there isn’t a storage administrator – the person running the vSphere layer is the only infrastructure role). It’s got performance, and availability. The economics, particularly at small scales are very compelling. In fact, let’s talk about economics…
“VSAN has a different economic envelope/is X% cheaper”
As a reminder, my opinion on this: “It IS NOT an accurate statement to say (as a general statement) that VSAN is lower CAPEX than external storage – though it IS accurate that it offers a compelling CAPEX picture in many use cases.”
Ok, this is potentially radioactive, and I’m sure I will be getting calls from various product teams for the following. If I offend all equally, it means I’m probably close to reality :-)
[UPDATE: while most of the feedback was on the topics above, I got lots on the below as well. While no one knows the final VSAN economic model, it will be GA soon, and I encourage readers to evaluate my conclusions. I’m still quite convinced of the data below, and the conclusions I came to:
- VSAN’s CAPEX sweet spot is in the space below the sweet spot of “classic” Type 1 arrays, which have a “saw-tooth” shape. As they add capacity and IOps, they asymptotically approach any given unit’s maximum capacity/IOps and minimum $/GB and $/IOps.
- Type 2 storage models (software + commodity hardware OR software/hardware appliances) are linear – so at various scaling points can be more or less expensive (CAPEX-wise).
- Data services matter in the economic calculus. I’ll go one step further than I did before on this topic. VSAN will have a very compelling $/IOps curve, which is one of the reasons VDI will be a sweet spot for VSAN. I would argue (and customers will decide) that at moderate (high hundreds) to large (thousands to many tens of thousands) of VDI images, PARTICULARLY where full clones dominate and for many reasons, linked clones are not used, the data services around inline deduplication and architectures that bias towards very consistent “full dataset low-latency” will favor all-flash arrays.
Below is a graph – with the Y-axis removed, that looks at the CAPEX economics ($/GB, through the $/IOps curves are similarly shaped) that compares a “Type 1” hardware appliance (in this case I used a VNX – but trust me, it would be similar with NetApp, Tintri, Nexenta, and other “Type 1” architecures) with “Type 2” software + commodity 2-socket 2U server hardware with some local HDD and SSD (and in this case using either the VSAN or ScaleIO software stacks). In the VSAN case, there’s an additional line that modeled a “less normal” server as the ESX host – one that was optimized for capacity (BTW, ScaleIO would less “low” than VSAN there because it’s licensed by GB vs. socket).
Take a look…
Ok – so what do you see?
Well – as you can see, the Type 1 “hardware appliance” has a $/GB which is a curve. The curve is because the incremental cost of capacity (or IOps) decreases as you add capacity, but there is a “step in” for the brains. Even in cases where you have the ability to add storage in the form of servers (think Invicta), it’s the same. If this continued out to the right (larger capacity or IOps) eventually it has a “sawtooth” shape, representing that you cross the limits of the scale-up brains (stepping up into larger brains, or scaling by adding another clustered appliances). Ergo: “classic scale-up economics.”
Conversely, the lines of the VSAN/ScaleIO model are “flat” meaning that adding capacity (and BTW – similar behavior for a $/IOps graph in both cases) is linear. BTW – this is true of Type 2 models (loosely coupled scale-out architectures) whether they are packaged as software (VSAN, ScaleIO) or appliances (think XtremIO, Isilon, Nutanix).
FWIW in the model below (because people will be curious)
- the models of VSAN and ScaleIO assume that server hardware is included, but only the storage component is accounted for (the compute/network is used for compute workloads, and CPU utilization of VSAN/ScaleIO are light).
- The model for the VNX curves was a VNX5400 and assumed less discount from “list” than we see as average street prices (ergo I was being hard on the VNX to not have any internal bias).
- The VNX model assumes use of iSCSI and NFS, otherwise the whole VNX line “steps up” based on the cost of FC/FCoE infrastructure.
Which of these curves are “better”?
Umm – wise reader, you’re smarter than that. Answer is that it depends on workload, it depends on scale.
I actually was VERY hard on the VNX here (harder than on ScaleIO/VSAN) assuming very poor discounting, ergo higher than “street” prices – but still the curves at scales greater than 80TB, the $/GB (pure CAPEX) is lower than VSAN on an “average” server.
Do you see the other thing that breaks down in this comparison?
- There is no correction for data services – whose value/economic impact will vary wildly based on workload and dataset. For example, how should things like deduplication or compression be factored in? It is also more correct to call the use of the SSD/PCIe flash in VSAN as a write cache than a tier per se (in VNX land, analagous to FAST Cache vs. a FAST VP tier) – so a workload that needed to pin to an SSD tier might be disadvantaged. The “extreme” example of this might be something like XtremIO. If you have hundreds, perhaps 1000 VDI instances using View or Citrix Xendesktop – and some or all of them are using “full clone” mechanics, the inline dedupe of XtremIO, and “always on flash” economics are overwhelmingly compelling, overcoming the CAPEX step in of XtremIO (which is higher than VSAN on server hardware). VSAN may have “all SSD” or inline dedupe capability in the future, but certainly does not right now.
- Outside a comparison with something like Tintri or a Nutanix/Simplvity, which is also a “stores only virtual machines” We are comparing two fairly different things. Very rarely does a customer use a NetApp, EMC VNX, or Nextenta, or XtremIO, or (a long list) use that platform for 100% storing VMDKs. This means that the cost of storing the VMDKs themselves is “shifted” on the curve above. Example: If a customer has 20TB of VMDKs, and 80TB of general purpose NAS (not uncommon – VMDKs are generally not the biggest consumer of storage at customers), then the cost of the 20TB of VMDKs is actually at the 100TB point on the curve.
More importantly than all that – the graph above only looks at CAPEX. It’s much harder to quantify the OPEX benefits of a hyper-converged model like VSAN – but it’s there to be sure (and in my opinion much more important). It’s simply much easier to manage for the VMware admin (no multipathing/LACP, no queuing, no network design, no consideration of datastores, etc…). However, like the CAPEX example – if the external storage is being used for other use cases – then the opex benefit in those other use cases perhaps offsets the higher OPEX/complexity for the VMware use case.
So – a couple takeaways:
- VSAN IS compelling economically.
- VSAN is most compelling for smaller environments – not because it doesn’t scale, or isn’t economically viable at larger scale, but rather the fact that VSAN starts small (relative to other Type , and costs are linear vs. curves for the “classic” external storage Type 1
- VSAN is most compelling for customers where almost all their workloads are VMDKs – and they can get away with using Linux/Windows VMs for simple NAS (really breaks down fast)
- For customers at larger scales, the CAPEX picture is a little more complex, and should be considered against a broader set of workloads. The alternate consideration is a “VSAN Island” which may be the right approach for some.
- For customers at larger scales, it’s likely that there are some virtual machine workloads (imagine a customer with many VMs, some of which are a 4 node Oracle RAC cluster running as VMs, with expectations of replication, consistency groups and hard hardware resilience) that even if a customer is using VSAN, will also use external storage. It’s for this reason that “abstraction and automation” of storage is so important – customers will have blends of architectures. Remember that workloads drive architectures – not vice versa.
This last point is important and why from an EMC perspective (and aligned with VMware – this a “Federation” level view), we think this captures the current and future landscape (and is a single datacenter persistence architecture for current and future workloads) – the only question is how much (including for some customers using none of some of the architectures!) – depending on your workloads – you MAY need of each type of architecture:
SIDEBAR #1: I really want to highlight an important note here for people who follow my blog or I see at VMUGs/VMworld/EMCWorld. Periodically I’ll be say something that “hangs out there” like: “the All Flash Array appliance architectures are disrupting the Hybrid appliance architectures – but software-defined data planes will disrupt the storage industry far more”.
Often when I say one of those vague-ish forward looking statements, I know something I can’t talk about, or it’s based on lots of customer NDA interaction. That statement I made above and have said repeatedly for the last year is one of them. I’ll explain a little further now that the cat is out of the bag and VSAN and ViPR data services are GA.
As technologists, we tend to look at how technology disrupts other technology. Think of the debates of “Type 2” architectures disrupting “Type 1”. Another great and very current example is mature hybrid software/hardware appliances (think NetApp or VNX) supporting mixed consolidated workloads gets disrupted by new purpose-built All-Flash arrays (think XtremIO) that target specific workloads like VDI, and low latency OLTP workloads with a core new technology capability (and then expand over time). That is indeed a disruption, and usually what people focus on. Interestingly to me, that’s not the big picture.
The big picture we we often miss as technologists is that the big technology disruptions are ones that disrupt industries and reshape revenue and profit pools, particularly if the incumbents are unwilling or unable to adapt.
Q: Do SDS data planes disrupt industries, reshape revenue & profit pools.
A: Yes. SDS data planes enable: software only revenue/margin models + bypassing hardware low-margin to the server vendors + not needing to build hardware supply chains + software support models = enabling new software-only players can enter a new market with a new core business model
Q: Do AFA hardware appliances do the same?
A: No. AFA hardware appliances do not disrupt industries and profit pools (unless the incumbent is too stupid/slow to embrace) because while they are a new technology – they don’t change the fundamental economic model.
If you read/study Clay Christianson – acclaimed and brilliant author of “The Innovator’s Dilemma”, you know that this the above topic is the thing that reshapes industries and leaves “mass extinction events”. I was lucky enough to hang out with him recently with the EMC/VMware/Pivotal leadership – and it was very eye-opening.
Don’t get me wrong, disruption takes time. I met a customer yesterday that uses EMC VMAXes for some of the worlds biggest Oracle RAC OLTP workload that would be a GREAT performance fit/density for self-disruption with XtremIO, or using ScaleIO and servers packed with flash. Why not go that way? Answer, they need T10 diff (end to end data integrity from the oracle DB layer down to the persistence layer) desperately, and they need to replicate across thousands of miles with huge consistency and consistency groups. They won’t use ScaleIO or VSAN, or purpose-built AFAs anytime soon (they will use Hybrids with mature data services loaded with SSDs) – BUT they will move over time. How? They will move to new application workloads will skip this RDBMS + hardware resilient architecture all together.
The tendency of disruption is that it happens slowly, and then all at once.
I’m VERY glad to be at EMC where we are embracing this self-disruption of software data planes for all types of storage with our full force and resources – not only ABLE to self-disrupt with software data services stacks technically, but WILLING (because we have a porfolio and aren’t dependent on a single hardware appliance profit pool and economic structure). Furthermore, we can manage and will and grow through this period of disruption with the full portfolio including the hardware appliances which are growing if we do it right. If I an employee or shareholder were at a place that looked the opposite (unwilling or unable), I would be freaked out.
Back to regularly scheduled programming….
“VSAN enables a new management model” and “Hyper-converged is the way”
As a reminder, my opinion on this: “It IS an accurate statement that VSAN is a quantum leap in simplicity, integrated management and VM-level operations/management.” and “It IS an accurate statement that VSAN is an example of a “hyper-convergence” architecture – and these architectures can be compelling in certain use cases.”
Don’t read another word on the simplicity of VSAN. Download it, try it, and come to your own conclusion. My conclusion – anyone trying to compete with VSAN on “easy for the VMware administrator” as their main argument is going to lose.
Now, on to the hyper-convergence point. The impact of these new “hyper-converged” models (that integrate storage and compute into an integrated server/flash/HDD scale-out platform) is very real, particularly for customers at smaller scales (again, not trying to claim that they DON’T scale – they do – rather that the economic value is greatest at smaller scales, in cases where compute/storage scale together, and where management simplicity is critical – which is common where there’s one person responsible for everything).
My experience with customers (and the tale of the tape of growth rates of converged platforms in all their forms - $1.2B of Vblocks – WOW, and clear emergence of startups like Nutanix/Simpilvity where VSAN and VSAN appliances will start to compete) is that increasingly, customers want converged architectures (vs. mix and match).
SIDEBAR #2: Like sidebar #1, I can now talk about something I hinted at in Dec 2012 in the “Top 10 predictions” – how “Converged Infrastructure comes in many forms”, and that starting in 2013 that battle would start to be waged furiously. This was the in the webcast…
The observation I’ve come to realize in 2012 is that while Vblock and an appliance constructed out of servers and VSAN are architecturally very different, they have the same core “value proposition” to a customer: “stop consuming, managing, and supporting servers/network/storage as ‘separate things’”.
Now, they are NOT the same architecturally – but its more about how those architectures manifest in “system characteristics”.
- Example – It’s very hard to scale down a Vblock’s current architecture down into the domain of VSAN appliances (or things like Nutanix/Simplivity).
- Example – when compute and storage don’t scale together, architectures like VSAN appliances aren’t as flexible as Vblock style architectures.
- Example - a VSAN appliance is awesome for VDI at smaller scales (particularly if you use linked clones). A Vblock with XtremIO is very compelling for VDI/EuC use cases at larger scales (regardless of linked/full clones).
- Example – if you’re a hyper-scale customer, even the Vblock’s current architecture (step function of Vblock 300 series or 700 series) is not ideal (and hyper-converged isn’t either). The analogy of Google Filesystem and Hadoop/HDFS aren’t quite right because general purpose compute problems aren’t distributed the same way as a massively parallelizable job against local storage. In these use case, those customer tend to bias to rack-scale architectural models (look up Intel’s RSA efforts here for a peek into that world – a category where we are doing a ton of stuff also).
But – they all the above share the CORE customer value proposition of the “class” of “Converged Infrastructure”
[UPDATE: Some readers started wild speculation about my underline of “current architecture” when referring to Vblock. My point was a more basic one. The value proposition of a Vblock is “mature CI for broad enterprise workloads – that nails the “simple, accelerated infrastructure, acquired, supported, maintained as a single converged infrastructure stack”. **IF** VCE decided to use technologies from VMware, Cisco and EMC (within the VCE charter of “Cisco for Compute/Network, VMware for Hypervisor and Vision integration, EMC for storage and protection”) to architect future Vblock using different technologies to address certain segments or workloads – is it still a Vblock, and does it preserve that value proposition? YES. That could still be called a Vblock. A Vblock isn’t defined by the components, but by the value proposition.
I’ve seen silly little stickers the Nutanix folks are putting out there (FWIW guys – going negative never ends well) suggesting that one architecture is the right way. As usual – when someone says “one way is always the right way”, it’s a hint they have one approach only (maybe a good approach for some, but never all) – guys – you’re better than that.
I suspect that in the same way there have been, and will continue storage architectures in the “phylum” of Type 1, Type 2, Type 3, and Type 4 – we will see a similar “phylum” of Converged Infrastructures emerge (I call them Hyper-Converged, Independently Scaled, Rack-Scale)– and you can count on us to participate in all variations, just like we innovate and lead in all the 4 storage architectural variations.
Back to regularly scheduled programming….
“Every customer should use VSAN”
As a reminder, my opinion on this: “It IS an accurate statement that VSAN is a great new option for a lot of customers.”
Like all storage stacks – VSAN will take time to harden. I know VSAN has a strong and aggressive roadmap. I also know based experience with new storage stacks – there will be bumps, and things learnt. In my experience, VSAN is solid, and performs very well. I think people should absolutely start to evaluate and use VSAN for a broad set of use cases.
VSAN is real, it’s an additional valuable arrow in the quiver of choices. For some customers for whom it can cover all the use cases, it might be the best way to store VMs.
In fact, to ensure EMC folks in the field don’t fight VSAN based on the wrong motivations (something I expect will happen in the storage industry in general) – VSAN will be in the EMC Select catalog very shortly after GA – so if it’s the right choice for a given use case, the EMC team, VMware team – or more likely our channel partners can all offer them equally with balanced motivation focused on what is right for the customer.
In closing – VSAN rocks, joins the pantheon of choices for customers and partners – and ultimately it doesn’t matter what I think… the market decides! I’m really glad this day is here, and once again – congrats to the VMware team for giving birth to a great new technology!
Thanks for spending the time to read these thoughts – and I would love your opinion – perhaps I’m wrong, perhaps I’m off my rocker… Discuss/debate!
Great blog Chad, thank you.
I'm inteststed to understand how you see the new hyper-scale hardware plays (such as the Facebook, Open Compute Project) playing in a post vSAN (ScaleIO) world.
As I understand it, these plaforms have been designed for massive HDFS worksets, on nodes with large amounts of locally attached storage. So presumably, very appropriate for SDS and would arguably have a crazy low $ per GB (for SMB, basic workloads :) ?
Exciting times, thank you again it's always great to read your blogs.
Posted by: @infrx | March 07, 2014 at 04:53 AM
Hey Chad, Awesome post as usual. The point id like to make here you may be slightly under-estimating the use case for hyperconverged architectures. As a current VMAX / VNX / Isilon / DD customer, we have alot of EMC tech in our Datacenter. Recently our IT brethren in the UK (My company has 2 IT Depts, 1 in the UK and 1 for Americas) shared with us their cost for colo services and how many racks they have. Our Americas CIO was astounded that we have more than 2X the amount of racks they have. We have been asked to look at ways to consolidate footprints. We are already a Cisco UCS Blade Customer and we are over 90% virtualized so we started to look elsewhere. Awhile back I stumbled across Hyper-converged architectures like Nutanix and ScaleIO and started to immediately try to understand them. I must say that im pretty impressed with the Nutanix implementation specifically from a density perspective. If nothing else was a factor, like power and cooling, we could shrink from 42 Racks in 1 of our colo's to 3/4 of a single rack all things being equal. Obviously there would be testing to do to ensure IOP / latency requirements are met but looking at what we spend per year for colo makes this a very interesting architecture to keep looking at.
Posted by: Mike U | March 07, 2014 at 12:51 PM
Hey Chad -
I am 80% sure that Nutanix uses 1MB slices (called extents in their case). They create groups of four extents and create a backup copy on a different node. Nutanix will localize/relocate all data to the host that owns the VM. Their web site is pretty good at explaining how everything works.
I was surprised when you wrote that VSAN localizes data as well. When I asked this during a session at PEX, The answer was an absolute NO. I understand that it uses SSD for local caching, but the data may "live" somewhere else and will not be relocated based on which node owns the VM.
Posted by: Dave Convery | March 07, 2014 at 03:26 PM
Excellent post Chad, thank you.
I'm intrigued on the way EMC can enter into the hc space in this way, perhaps in the "ready-node" area.
With your access to great tech (extremesf etc) and obvious engineering talent on board, there's tremendous scope to execute.
The potential integration with other products like recoverpoint is exciting.
As a consumer, I'll be keenly watching this space :)
Posted by: Brett | March 07, 2014 at 07:06 PM
** Disclaimer: Duncan Epping - VMware employee **
Hey Chad,
Some technical inaccuracies in your post:
1) There are solutions which sit in the hypervisor which are supported. (Pernix has PVSP support as they do not fit in to a category but aren't doing hacks like some of the other vendors: http://www.vmware.com/resources/compatibility/vcl/partnersupport.php)
2) "Because they are suggesting that they have a poor IO path efficiency in general (a misconception people have struggled for years to correct). This would apply to any IO traversing the IO stack – which would be VERY problematic for all kinds of high load IO workloads in VMs." --> Not really, the argument here would be you are traversing the storage stack multiple times! A normal VM doesn't do this when going to an external array right.
3) "The thing that is truly different about VSAN is the fact that it has a VM-level awareness that is linked to VM HA behavior" --> There is no such a thing, HA will just restart the VMs where it feels it needs to restart them. Not really based on where the data sits.
4) "The design center of VSAN leverages the fact that it has awareness of VM objects as persistence structures, and works to keep VMs running and using a persistence layer that is LOCAL to the compute instance." --> There is no notion of data locality. DRS moves VMs around however it pleases and cache and writes/reads are done over the network when needed. Actually half of your read IO when running with "failures to tolerate = 1" will always come from the network even when your VM sits on a host which holds the object.
5) "It interacts with VM HA behavior to work to failover VMs to nodes that happen to have protection copies of data." --> It doesn't need to. It can and will access data remotely.
6) "The other factor here is that the performance of a given VM using VSAN is the performance of the local node (which given SSD “write absorption” can be very good)." --> not true, reads can come from many hosts depending on "the number of failures" and the "stripe width". Writes will also go to multiple nodes. You can stripe data across 12 nodes, and then there is the replication factor. Data is striped in this case with 1MB chunks... so it could come from anywhere at that point when you run with a large stripe width and number of failures.
Posted by: Duncan | March 09, 2014 at 04:22 AM
I was also going to mention PernixData and what FVP is able to do with kernel-mode integration. I personally like PernixData over VSAN if all I'm worried about is workload performance. Yes, it's a cache only software solution, but I still can keep my enterprise storage for tasks like cloning and replication - and decouple capacity requirements from performance.
I also would say your thoughts around "only in the kernel=performance FUD" are not accurate. I won't re-invent the wheel on this but Frank Denneman has an excellent blog that details why kernel mode integration is better than using a controller vm. There are multiple things to consider:
http://frankdenneman.nl/2013/06/18/basic-elements-of-the-flash-virtualization-platform-part-1/
Posted by: forbsy | March 09, 2014 at 01:27 PM
I found this post really insightful Chad - thanks for it.
In regards to Duncan's point, I don't see how those in the PVSP interact differently than VSAN does. Maybe he can post details on that on another blog post.
In regards to going through the hypervisor stack vs the efficiency of kernel modules, I'm in sharp agreement. The numbers are different but 99.999% of the time negligible to a virtualized workload. Workloads need consistency at a level centralized storage on its own has struggled to provide as its overloaded with the high-density of virtualized environments.
I'm a huge fan of Server SAN and Server-side caching as it offers alternatives to the tight coupling of storage capacity and performance. I'll add the disclaimer that I work for one of those vendors at Infinio, but that doesn't blind me from the benefits of approaching the problem from many angles.
Thanks again!
Posted by: Mjbrender | March 09, 2014 at 05:17 PM
Hi Chad, great post and it confirmed some of my thoughts. One question, Mike U touched on it but I wanted to confirm, if you looked at the total TCO of VSAN vs an external array + compute, I would have thought on certain densities VSAN would win hands down?
The additional cost overhead from external servers and interconnects would tally to a lot more than a number of VSAN nodes with compute within the hosts themselves. So while on the face of it, the $/GB on an external array dips below that of VSAN above say 80TB, I suspect its much higher when you include the cost of compute also to deliver the same outcome?
Posted by: Rob P | March 10, 2014 at 06:01 PM
For enterprises which are running premium apps which are licensed on a per processor core basis such as Oracle DB Enterprise or Business Intelligence Foundation Suite at $300k per core plus anual support, it would be extremely costly to move the storage processing workload from a traditional Type 1 SAN such as a VNX or FAS array to any host based storage such as VSAN, Nutanix or Simplivity. The licensing implications on these business critical apps must be calculated before assuming that any of these hyper-converged solutions will reduce system CAPEX not just storage CAPEX.
IMHO VVols will be a far more significant milestone in the history of Software Defined Storage. What are the odds that VVols will GA in 2014?
Posted by: Dave Stark | March 11, 2014 at 01:16 AM
Disclaimer: Frank Denneman– PernixData Employee
That’s a lengthy article Chad, luckily you are enjoying your holiday and are able to get some rest after writing this one.
All kidding aside, Kernel modules is the way to go. And regardless if you are VMware itself or a third party vendor, you can write software code that fits nicely in the kernel and scale together with the kernel. PernixData has done this. Granted we have a collection of extreme gifted engineers that understand the kernel code like no other, but we proved it could be done. VMware reviewed and tested our code and VMware certified the FVP software.
To quote you:
This is the very basic reason why ScaleIO has a kernel-loadable module for Linux kernels (used with KVM, Xen) and Windows (used with Hyper-V), but not vSphere (where it requires a virtual appliance model – with the corresponding “convoluted” IO path).
I’m curious if writing kernel extension modules is not the primary reason for performance, why is the Scale IO team investing time and energy in writing kernel code for Linux and Windows, but not for VMware. Why not use a common, transportable code for all platform? Open formats such as virtual machines can run on many different platforms and would reduce development greatly.
Why, because many people and I other believes that kernel code is the only way to provide scaleable performance, reduction of resource management and operational simplicity.
Storage kernels are purposely build to provide storage functionality to a variety and multiplicity of virtual machines. When extending the kernel modules, your code scales inherently with the hypervisor. Sitting at a lower layer allows you to play well with others. This is not the case with VM centric storage solutions.
Are Hypervisors build from the ground up to “offload” their functionality to a guest world? Talk about a convoluted path! Introducing guest worlds that are responsible for major part natively handled by the kernel. These storage vms become depended on other schedulers sitting lower in the kernel, interacting with each other. And vice versa, if the storage command cannot be executed or completed, the CPU scheduler waits for the commands to complete before it can schedule the storage VM. See the problem? With a couple of VMs and a storage VM it might not be as problematic as I describe, but what if your environment is running massive amounts of VMs?
Context switching is one, allowing a guest world to take responsibility for the majority of performance is something completely different. In my opinion, hypervisors where never designed to have a virtual machine assume the role of a storage scheduler. With introducing service VM, virtual appliance (give it any other fancy name) you are bubbling up the responsibility where it has no place. Exposing it to other schedulers who do not understand the importance of that particular virtual machine to the rest of the virtual infrastructure. You can create a lot of catch-22 situations if not designed correctly. Remember, VMware is continuously improving their kernel code to have improve their internal structures. This is complex stuff.
Which alludes to the following problem, management overhead. There is a virtual machine, fully exposed between the rest of the virtual machines. You need to manage it from a resource management perspective, remember you can set a CPU reservation but that does not mean it can kick off resident and active threads on the CPUs. That’s the responsibility of the kernel and then you have the problem of security. In my days as an architect I’ve seen some “non-standard” environments where junior admins had full control. You don’t want to have the risk of accidental shutdowns on that layer. And if we talk about setting reservations, which other clustering service are you impacting? Think HA, think DRS, think convoluted design here.
Harden, ensure, encapsulate your basic compute and storage services, don’t leave them exposed and that’s what you are doing with a virtual machine running storage code.
And we can talk about scalability from east to west, horizontal throughout the cluster, but if I start, my comment might be as long as your article.
Posted by: FrankDenneman | March 11, 2014 at 08:45 AM