« Do you like hands-on? Do you have an opinion? | Main | Some Openstack updates »

March 06, 2014


Feed You can follow this conversation by subscribing to the comment feed for this post.


Great blog Chad, thank you.

I'm inteststed to understand how you see the new hyper-scale hardware plays (such as the Facebook, Open Compute Project) playing in a post vSAN (ScaleIO) world.

As I understand it, these plaforms have been designed for massive HDFS worksets, on nodes with large amounts of locally attached storage. So presumably, very appropriate for SDS and would arguably have a crazy low $ per GB (for SMB, basic workloads :) ?

Exciting times, thank you again it's always great to read your blogs.

Mike U

Hey Chad, Awesome post as usual. The point id like to make here you may be slightly under-estimating the use case for hyperconverged architectures. As a current VMAX / VNX / Isilon / DD customer, we have alot of EMC tech in our Datacenter. Recently our IT brethren in the UK (My company has 2 IT Depts, 1 in the UK and 1 for Americas) shared with us their cost for colo services and how many racks they have. Our Americas CIO was astounded that we have more than 2X the amount of racks they have. We have been asked to look at ways to consolidate footprints. We are already a Cisco UCS Blade Customer and we are over 90% virtualized so we started to look elsewhere. Awhile back I stumbled across Hyper-converged architectures like Nutanix and ScaleIO and started to immediately try to understand them. I must say that im pretty impressed with the Nutanix implementation specifically from a density perspective. If nothing else was a factor, like power and cooling, we could shrink from 42 Racks in 1 of our colo's to 3/4 of a single rack all things being equal. Obviously there would be testing to do to ensure IOP / latency requirements are met but looking at what we spend per year for colo makes this a very interesting architecture to keep looking at.

Dave Convery

Hey Chad -
I am 80% sure that Nutanix uses 1MB slices (called extents in their case). They create groups of four extents and create a backup copy on a different node. Nutanix will localize/relocate all data to the host that owns the VM. Their web site is pretty good at explaining how everything works.

I was surprised when you wrote that VSAN localizes data as well. When I asked this during a session at PEX, The answer was an absolute NO. I understand that it uses SSD for local caching, but the data may "live" somewhere else and will not be relocated based on which node owns the VM.


Excellent post Chad, thank you.

I'm intrigued on the way EMC can enter into the hc space in this way, perhaps in the "ready-node" area.

With your access to great tech (extremesf etc) and obvious engineering talent on board, there's tremendous scope to execute.

The potential integration with other products like recoverpoint is exciting.

As a consumer, I'll be keenly watching this space :)


** Disclaimer: Duncan Epping - VMware employee **

Hey Chad,

Some technical inaccuracies in your post:

1) There are solutions which sit in the hypervisor which are supported. (Pernix has PVSP support as they do not fit in to a category but aren't doing hacks like some of the other vendors: http://www.vmware.com/resources/compatibility/vcl/partnersupport.php)

2) "Because they are suggesting that they have a poor IO path efficiency in general (a misconception people have struggled for years to correct). This would apply to any IO traversing the IO stack – which would be VERY problematic for all kinds of high load IO workloads in VMs." --> Not really, the argument here would be you are traversing the storage stack multiple times! A normal VM doesn't do this when going to an external array right.

3) "The thing that is truly different about VSAN is the fact that it has a VM-level awareness that is linked to VM HA behavior" --> There is no such a thing, HA will just restart the VMs where it feels it needs to restart them. Not really based on where the data sits.

4) "The design center of VSAN leverages the fact that it has awareness of VM objects as persistence structures, and works to keep VMs running and using a persistence layer that is LOCAL to the compute instance." --> There is no notion of data locality. DRS moves VMs around however it pleases and cache and writes/reads are done over the network when needed. Actually half of your read IO when running with "failures to tolerate = 1" will always come from the network even when your VM sits on a host which holds the object.

5) "It interacts with VM HA behavior to work to failover VMs to nodes that happen to have protection copies of data." --> It doesn't need to. It can and will access data remotely.

6) "The other factor here is that the performance of a given VM using VSAN is the performance of the local node (which given SSD “write absorption” can be very good)." --> not true, reads can come from many hosts depending on "the number of failures" and the "stripe width". Writes will also go to multiple nodes. You can stripe data across 12 nodes, and then there is the replication factor. Data is striped in this case with 1MB chunks... so it could come from anywhere at that point when you run with a large stripe width and number of failures.


I was also going to mention PernixData and what FVP is able to do with kernel-mode integration. I personally like PernixData over VSAN if all I'm worried about is workload performance. Yes, it's a cache only software solution, but I still can keep my enterprise storage for tasks like cloning and replication - and decouple capacity requirements from performance.
I also would say your thoughts around "only in the kernel=performance FUD" are not accurate. I won't re-invent the wheel on this but Frank Denneman has an excellent blog that details why kernel mode integration is better than using a controller vm. There are multiple things to consider:



I found this post really insightful Chad - thanks for it.

In regards to Duncan's point, I don't see how those in the PVSP interact differently than VSAN does. Maybe he can post details on that on another blog post.

In regards to going through the hypervisor stack vs the efficiency of kernel modules, I'm in sharp agreement. The numbers are different but 99.999% of the time negligible to a virtualized workload. Workloads need consistency at a level centralized storage on its own has struggled to provide as its overloaded with the high-density of virtualized environments.

I'm a huge fan of Server SAN and Server-side caching as it offers alternatives to the tight coupling of storage capacity and performance. I'll add the disclaimer that I work for one of those vendors at Infinio, but that doesn't blind me from the benefits of approaching the problem from many angles.

Thanks again!

Rob P

Hi Chad, great post and it confirmed some of my thoughts. One question, Mike U touched on it but I wanted to confirm, if you looked at the total TCO of VSAN vs an external array + compute, I would have thought on certain densities VSAN would win hands down?

The additional cost overhead from external servers and interconnects would tally to a lot more than a number of VSAN nodes with compute within the hosts themselves. So while on the face of it, the $/GB on an external array dips below that of VSAN above say 80TB, I suspect its much higher when you include the cost of compute also to deliver the same outcome?

Dave Stark

For enterprises which are running premium apps which are licensed on a per processor core basis such as Oracle DB Enterprise or Business Intelligence Foundation Suite at $300k per core plus anual support, it would be extremely costly to move the storage processing workload from a traditional Type 1 SAN such as a VNX or FAS array to any host based storage such as VSAN, Nutanix or Simplivity. The licensing implications on these business critical apps must be calculated before assuming that any of these hyper-converged solutions will reduce system CAPEX not just storage CAPEX.

IMHO VVols will be a far more significant milestone in the history of Software Defined Storage. What are the odds that VVols will GA in 2014?


Disclaimer: Frank Denneman– PernixData Employee

That’s a lengthy article Chad, luckily you are enjoying your holiday and are able to get some rest after writing this one.

All kidding aside, Kernel modules is the way to go. And regardless if you are VMware itself or a third party vendor, you can write software code that fits nicely in the kernel and scale together with the kernel. PernixData has done this. Granted we have a collection of extreme gifted engineers that understand the kernel code like no other, but we proved it could be done. VMware reviewed and tested our code and VMware certified the FVP software.

To quote you:

This is the very basic reason why ScaleIO has a kernel-loadable module for Linux kernels (used with KVM, Xen) and Windows (used with Hyper-V), but not vSphere (where it requires a virtual appliance model – with the corresponding “convoluted” IO path).

I’m curious if writing kernel extension modules is not the primary reason for performance, why is the Scale IO team investing time and energy in writing kernel code for Linux and Windows, but not for VMware. Why not use a common, transportable code for all platform? Open formats such as virtual machines can run on many different platforms and would reduce development greatly.

Why, because many people and I other believes that kernel code is the only way to provide scaleable performance, reduction of resource management and operational simplicity.

Storage kernels are purposely build to provide storage functionality to a variety and multiplicity of virtual machines. When extending the kernel modules, your code scales inherently with the hypervisor. Sitting at a lower layer allows you to play well with others. This is not the case with VM centric storage solutions.

Are Hypervisors build from the ground up to “offload” their functionality to a guest world? Talk about a convoluted path! Introducing guest worlds that are responsible for major part natively handled by the kernel. These storage vms become depended on other schedulers sitting lower in the kernel, interacting with each other. And vice versa, if the storage command cannot be executed or completed, the CPU scheduler waits for the commands to complete before it can schedule the storage VM. See the problem? With a couple of VMs and a storage VM it might not be as problematic as I describe, but what if your environment is running massive amounts of VMs?

Context switching is one, allowing a guest world to take responsibility for the majority of performance is something completely different. In my opinion, hypervisors where never designed to have a virtual machine assume the role of a storage scheduler. With introducing service VM, virtual appliance (give it any other fancy name) you are bubbling up the responsibility where it has no place. Exposing it to other schedulers who do not understand the importance of that particular virtual machine to the rest of the virtual infrastructure. You can create a lot of catch-22 situations if not designed correctly. Remember, VMware is continuously improving their kernel code to have improve their internal structures. This is complex stuff.

Which alludes to the following problem, management overhead. There is a virtual machine, fully exposed between the rest of the virtual machines. You need to manage it from a resource management perspective, remember you can set a CPU reservation but that does not mean it can kick off resident and active threads on the CPUs. That’s the responsibility of the kernel and then you have the problem of security. In my days as an architect I’ve seen some “non-standard” environments where junior admins had full control. You don’t want to have the risk of accidental shutdowns on that layer. And if we talk about setting reservations, which other clustering service are you impacting? Think HA, think DRS, think convoluted design here.

Harden, ensure, encapsulate your basic compute and storage services, don’t leave them exposed and that’s what you are doing with a virtual machine running storage code.

And we can talk about scalability from east to west, horizontal throughout the cluster, but if I start, my comment might be as long as your article.

The comments to this entry are closed.

  • BlogWithIntegrity.com


  • The opinions expressed here are my personal opinions. Content published here is not read or approved in advance by Dell Technologies and does not necessarily reflect the views and opinions of Dell Technologies or any part of Dell Technologies. This is my blog, it is not an Dell Technologies blog.