One popular topic of discussion in VMware and storage land is the topic of how VMware is affecting core storage architecture design principles.
I commonly summarize this via this slide:
- On the left you have “Type 1” transactional (defined as being good for small block random workloads that tend to characterize many VM workloads) systems that are “classic” clustered heads. Think of EMC VNX as being an example (and you can think easily of their competitors) on the left.
- Typically a great choice at moderate scale, they often are easy to use (since the architectures scale down relatively well), and have the most “swiss army knife” (ergo do a bit of everything) characteristics (as being one of the most mature architectural models). Their downside tends to be in failure conditions. Since storage is generally “owned” in some way (LUNS, filesystems), the failure behavior to the workload is invariably more complex. There are great solutions to mitigate this (ALUA, NPIV, host failure handling), but it is an architectural thing. There’s also the challenge of balancing workloads across brains, and across storage platforms as you get bigger and bigger.
- In the middle you have “Type 2” transactional systems. Think of EMC VMAX as being an example (and you can think easily of their competitors). They “scale out” in the sense that any IO can be served up and through any port, any “brain”.
- These tend to pull away for customers as they get larger for several reasons – failure modes are much cleaner (as the storage device – block or NAS) is visible broadly. Done right – the storage model also deals with balancing workload across a broader set of pooled resources (network, memory, CPU, cache, etc).
- On the right, you have the much more loosely coupled object storage models. Think of EMC Atmos as an example. These are usually pure software with no hardware dependency (and Atmos indeed can be layered on anything as a virtual appliance).
- They tend to be very cloud-like, awesome characteristics for supporting a lot of next-gen web apps – but NOT good for transactional workloads (and therefore hosting VMs). Making these systems transactional (with gateways, caches and the like) takes away their fundamental strengths.
When I present this, I’m commonly asked: “why do you say ‘NAS emerging’” in the middle column?
The answer is twofold:
- First, you can’t call it a scale-out model if all the traffic for a given thing ends up at a single brain, a single interface for a given datastore. This is the case with NFS deployments in vSphere 4.x and VI3.x.
- Second, scale-out NAS systems (even the best) typically have transactional latencies that are about 2x-3x the latencies of “Type 1” block and NAS systems, and “Type 2” block systems. They also typically have a higher practical $/IOps also. These aren’t functions of “pricing models”, but actually are reflections of underlying engineering challenges that are substantial. You can see this very clearly in the SPEC SFS results.
BUT – MAN, if you could scale-out that way… AND have workloads that are a fit, wouldn’t it be cool if….
…well, at VMworld 2011, in SP03977 we showed that you CAN.
Ok, so let’s take a look at the characteristics of true scale-out NAS model:
- Multiple Nodes Presenting a Single, Scalable File System and Volume – a single volume file system from multiple nodes. This is obvious when you think about it. You want a filesystem to be a volume spread across all the nodes, all the way down to a single file.
- N-Way, Scalable Resiliency. This is also obvious. There are the clear challenges where RAID double, triple, and other disk-level parity schemes fail as disks grow AND fileystem objects become petabyte scale. This is why erasure coding techniques are generally used at that scale. You also need a model where any one of the nodes can support IO for failed node(s), because BOY that would make failure behavior simple and elegant…
- Linearly Scalable IO and Throughput. This is important. If you are using a globalnamespace, but then directing the file-IO to a single node that hosts the file – you’re not spreading the load across a “big global pool”, and invariably failure models are more complex.
- Storage Efficiency. This is an outflow of the items above. In a true scale-out model – there is NO SUCH THING as “this data is here”- it’s everywhere. That means that there is no “balancing” (unless, for whatever reason, you WANT to do that).
In vSphere 5, the NFS client is still NFSv3. This means that for a single ESX host, at any given moment – all the traffic for a given ESX host uses a single TCP connection. BUT there are improvements where if you use a DNS name, it will leverage DNS round robin to access this across a pool of IP addresses. (yes, NFSv4 and more are coming in the future)
Put this basic fundamental change together with the core architecture of EMC Isilon (which can distribute that same filesystem not only a set of IP address like VNX and NetApp can, but unlike those more “classic” models can do it across a LARGE number of “brains” where the underlying fileystem is completely distributed) and you get what we did in this amazing demonstration:
What did the demo show?
- That the configuration is EXTREMELY simple.
- That as ESX hosts are powered up, they automatically balance the load across nodes.
- That as filesystems are created (via our simple VSI plugin without leaving vCenter), they automatically balance across all the nodes. Same is true as filesystems are grown – up to a current limit of 15PB for a single filesystem.
- That as additional Isilon nodes are added (to grow IOps, MBps, and capacity), all loads are automatically balanced. BTW – this could be up to a current limit of 144 nodes.
- That as complete nodes are taken down, there is no disruption or failover behavior (since the filesystem is presented via all nodes all the time) at the vSphere layer.
Is that kick-a$$ or what?
It’s notable (and please, folks, correct me if I’m wrong), there are no mainstream scale-out NAS competitors for whom their behavior would work like we showed in this demonstration.
This is what makes EMC Isilon a KILLER choice for customers today who are deploying VMs that are OK with that only other remaining caveat - 2x-3x higher transactional latency. Think 10-30ms as opposed to the 2-10ms you would see for block or NAS workloads on VNX or block workloads on VMAX. Also, the $/IOps characteristics for EMC Isilon mean that the fit is best for VMs that are relatively large, but do a relatively small number of IOps.
Now if you put that all together 10-30ms and capacity vs IOps is perfectly OK for many VMware workloads today – test/dev, many vCloud director use cases, vFabric Data Director use cases, etc.
Imagine if hypothetically, we were able to make Isilon work well with transactional workloads… :-)
Cool stuff – no? Comments welcome!