One popular topic of discussion in VMware and storage land is the topic of how VMware is affecting core storage architecture design principles.
I commonly summarize this via this slide:
- On the left you have “Type 1” transactional (defined as being good for small block random workloads that tend to characterize many VM workloads) systems that are “classic” clustered heads. Think of EMC VNX as being an example (and you can think easily of their competitors) on the left.
- Typically a great choice at moderate scale, they often are easy to use (since the architectures scale down relatively well), and have the most “swiss army knife” (ergo do a bit of everything) characteristics (as being one of the most mature architectural models). Their downside tends to be in failure conditions. Since storage is generally “owned” in some way (LUNS, filesystems), the failure behavior to the workload is invariably more complex. There are great solutions to mitigate this (ALUA, NPIV, host failure handling), but it is an architectural thing. There’s also the challenge of balancing workloads across brains, and across storage platforms as you get bigger and bigger.
- In the middle you have “Type 2” transactional systems. Think of EMC VMAX as being an example (and you can think easily of their competitors). They “scale out” in the sense that any IO can be served up and through any port, any “brain”.
- These tend to pull away for customers as they get larger for several reasons – failure modes are much cleaner (as the storage device – block or NAS) is visible broadly. Done right – the storage model also deals with balancing workload across a broader set of pooled resources (network, memory, CPU, cache, etc).
- On the right, you have the much more loosely coupled object storage models. Think of EMC Atmos as an example. These are usually pure software with no hardware dependency (and Atmos indeed can be layered on anything as a virtual appliance).
- They tend to be very cloud-like, awesome characteristics for supporting a lot of next-gen web apps – but NOT good for transactional workloads (and therefore hosting VMs). Making these systems transactional (with gateways, caches and the like) takes away their fundamental strengths.
When I present this, I’m commonly asked: “why do you say ‘NAS emerging’” in the middle column?
The answer is twofold:
- First, you can’t call it a scale-out model if all the traffic for a given thing ends up at a single brain, a single interface for a given datastore. This is the case with NFS deployments in vSphere 4.x and VI3.x.
- Second, scale-out NAS systems (even the best) typically have transactional latencies that are about 2x-3x the latencies of “Type 1” block and NAS systems, and “Type 2” block systems. They also typically have a higher practical $/IOps also. These aren’t functions of “pricing models”, but actually are reflections of underlying engineering challenges that are substantial. You can see this very clearly in the SPEC SFS results.
BUT – MAN, if you could scale-out that way… AND have workloads that are a fit, wouldn’t it be cool if….
…well, at VMworld 2011, in SP03977 we showed that you CAN.
Ok, so let’s take a look at the characteristics of true scale-out NAS model:
- Multiple Nodes Presenting a Single, Scalable File System and Volume – a single volume file system from multiple nodes. This is obvious when you think about it. You want a filesystem to be a volume spread across all the nodes, all the way down to a single file.
- N-Way, Scalable Resiliency. This is also obvious. There are the clear challenges where RAID double, triple, and other disk-level parity schemes fail as disks grow AND fileystem objects become petabyte scale. This is why erasure coding techniques are generally used at that scale. You also need a model where any one of the nodes can support IO for failed node(s), because BOY that would make failure behavior simple and elegant…
- Linearly Scalable IO and Throughput. This is important. If you are using a globalnamespace, but then directing the file-IO to a single node that hosts the file – you’re not spreading the load across a “big global pool”, and invariably failure models are more complex.
- Storage Efficiency. This is an outflow of the items above. In a true scale-out model – there is NO SUCH THING as “this data is here”- it’s everywhere. That means that there is no “balancing” (unless, for whatever reason, you WANT to do that).
In vSphere 5, the NFS client is still NFSv3. This means that for a single ESX host, at any given moment – all the traffic for a given ESX host uses a single TCP connection. BUT there are improvements where if you use a DNS name, it will leverage DNS round robin to access this across a pool of IP addresses. (yes, NFSv4 and more are coming in the future)
Put this basic fundamental change together with the core architecture of EMC Isilon (which can distribute that same filesystem not only a set of IP address like VNX and NetApp can, but unlike those more “classic” models can do it across a LARGE number of “brains” where the underlying fileystem is completely distributed) and you get what we did in this amazing demonstration:
What did the demo show?
- That the configuration is EXTREMELY simple.
- That as ESX hosts are powered up, they automatically balance the load across nodes.
- That as filesystems are created (via our simple VSI plugin without leaving vCenter), they automatically balance across all the nodes. Same is true as filesystems are grown – up to a current limit of 15PB for a single filesystem.
- That as additional Isilon nodes are added (to grow IOps, MBps, and capacity), all loads are automatically balanced. BTW – this could be up to a current limit of 144 nodes.
- That as complete nodes are taken down, there is no disruption or failover behavior (since the filesystem is presented via all nodes all the time) at the vSphere layer.
Is that kick-a$$ or what?
It’s notable (and please, folks, correct me if I’m wrong), there are no mainstream scale-out NAS competitors for whom their behavior would work like we showed in this demonstration.
This is what makes EMC Isilon a KILLER choice for customers today who are deploying VMs that are OK with that only other remaining caveat - 2x-3x higher transactional latency. Think 10-30ms as opposed to the 2-10ms you would see for block or NAS workloads on VNX or block workloads on VMAX. Also, the $/IOps characteristics for EMC Isilon mean that the fit is best for VMs that are relatively large, but do a relatively small number of IOps.
Now if you put that all together 10-30ms and capacity vs IOps is perfectly OK for many VMware workloads today – test/dev, many vCloud director use cases, vFabric Data Director use cases, etc.
Imagine if hypothetically, we were able to make Isilon work well with transactional workloads… :-)
Cool stuff – no? Comments welcome!
I am an NFS fan so i'm happy with thah news.
But with the namespace cluster nexenta can make scale out infrastructure. And with the zill of zfs all the transactionnal operation are cached on SSD or RAMDISk so all problem are gone in fact xD.
Posted by: nOon | September 01, 2011 at 11:03 AM
Chad,
Good Stuff. Isilon is uber upcoming and really changes things. My question is that do you still have latency issues when Isilon acts as an iSCSI target? If so, is it the architecture that causes this OR is it something else?
Posted by: Chappy | September 02, 2011 at 11:05 AM
So, Basically an XIV type solution for block or GPFS or SONAS for NFS ?
Many brains serving the same data.
Posted by: A Facebook User | September 07, 2011 at 02:35 AM
This behavior, "scale-out NAS systems (even the best) typically have transactional latencies that are about 2x-3x the latencies of “Type 1” block and NAS systems, and “Type 2” block systems"...is a problem specific to Isilon's OneFS filesystem (a known issue). It is not the same behavior in other NAS systems with scale-out capabilities. That is why Isilon is not (yet) suitable for VMDKs. Get those latencies worked-out, and you'll have a great solution...as you have boasted. But I see you've covered yourself by saying "...no mainstream scale-out NAS competitors...", which compartmentalizes your claims to only EMC & NTAP who own 70% of the market. Look at Oracle ZFS, IBRIX with 3Par, and Symantec-Huawei Filestore...all have scale-out and support transactional processing of small IO for low latency. Bring us API compatibility for Storage Awareness & Array Integration for Isilon, and folks may start to overlook the latency problems...
Posted by: mobiuslink | September 07, 2011 at 07:57 PM
@nOon - the point was that global namespace models (Nexenta example, but also all similar ones) don't have the same characteristics around "all files accessible via all nodes" that makes the scaling (and failure models). I'm a fan of open ZFS, but worry a little about how Nexenta is going to run with that ball with OpenSolaris going the way it is.
@A Facebook User - XIV, EqualLogic, 3PAR, VMAX are all architectural examples of the block scaling model. We would argue about better/worse and functional capabilities - but yeah, they are examples of the model. GPFS and SONAS are (and don't take me literally here - I'm not an expert on other folks) examples of global filesystem models that are different - metadata hosts and redirection.
@chappy - the iSCSI latency thing is still there. It absolutely works - but is a result of the core thing I'm pointing out (architectural challenge of "loosely coupled" distributed systems. It's not a protocol thing (iSCSI doesn't make it worse or better), it's intrinsic. Believe me - it's the top priority (for NAS and iSCSI use cases) for folks at Isilon.
@mobiuslink - I'm not sure if I agree. First of all, NetApp's approach is a globalnamespace approach, different (not saying not valid, but similar comments to @nOon). Also, all the data I've seen (look at the SPEC SFS results here: http://www.spec.org/sfs2008/results/sfs2008nfs.html show that SONAS, Filestore, IBRIX and other "loosely coupled" models show the same latency models. They all (including Isilon) use variations of NVRAM, SSD to minimize the distributed metadata and loosely coupled model. If there's public data to share - please, please link it! Also notable in the "fine print" is the number of filesystems used in these tests.
Remember all - I'm not saying these models are BAD, but rather that until they can serve an 4K 8K, 16K IO in 5-10ms, there's some workloads they are JUST NOT a fit for...
Posted by: Chad Sakac | September 08, 2011 at 09:49 AM
Very good stuff, I'm setting up a big Isilon right now.
Posted by: Bsmith9999 | January 10, 2012 at 11:50 AM
Hi chad,
i'm still working on article relative to our meeting in france at the beginning of february.
This article is quite cool to help me understanding all you've told us this night :)
and guess what ? 2013 is now for me storage's year for I need to be more loaded on this ;)
best regards !
Thomas aka VirtTom
Posted by: VirtTom | March 01, 2013 at 11:45 AM