[UPDATED: April 28th, 2014, 2:55am ET – I’ve gotten feedback that it’s more logical to transpose what I had labelled “Type 2” and “Type 3” – because then they are in a linear sequence from “most tightly coupled”to “least tightly coupled” which is more logical (as well as capturing visually the ability to “federate” (vs scale-out) Type 1. Blog updated to this effect, which has this sequence:
This is a topic that is a perennial one – and I suspect it will keep evolving. What is the right way to classify architectural models of storage? How does one figure out what the heck is going on through all the marketing and positioning of the industry players (including EMC for that matter)?
WARNING – THIS POST IS NOT FOR THE INTELLECTUALLY LAZY – OR PEOPLE WHO ARE NOT INTO READING :)
… Don’t say I didn’t warn you :-)
It IS possible to have a taxonomy for storage/persistence architectures, a “phylum” (that’s the grouping of animals in biology, and translates from new latin as “class”) if you will. If you think of the “tree of life” (below – hey, that’s us Humans right next to bears and fungi!) as the world of technology domains, down the information persistence (vs. compute vs. network) “kingdom”, there are indeed “phylum” (classes).
This is an powerful intellectual tool – one that helped humans understand the living world around us.
When it comes applying this “phylum” idea to the topic of storage architectures, it lets you group anything new you see in the whole world of “persisting data”, and QUICKLY understand it’s core architecture, and therefore strengths and weaknesses. I challenge you all to put something on the table that doesn’t fit these broad based buckets.
On another note, as you might imagine, there’s been a lot of discussion over the last 24 hour about Cisco UCS Invicta (formerly Whiptail assets) inside EMC. It’s an old familiar player (Whiptail) on the field in new form, I’ll try to put them into this framework (input welcome!)
This isn’t a new topic (people that know me are perhaps sick of it :-). I presented part of the thinking on this in STO5420 at VMworld this year. Tyler Britten – an AWESOME EMC SE (you rock Tyler!)– took up the ball, and posted on the topic in a series here.
But like all ideas – it adapts, evolves. Read on
Persistence (things that store data, do stuff with information in ANY form) can be grouped into 4 architectural buckets. The key is not to get hung up on some things that get people off track.
People tend to mentally group platforms into buckets based on “physical” things like interconnect (“hey they all use IB between nodes!”), or protocol (“It’s block, or NAS, or multiprotocol!”) or whether they are hardware appliances or software-only models (“it’s just software on a server!”).
This is the WRONG taxonomy, the wrong way of grouping. ARCHITECTURE (and I mean software architecture) is actually the correct grouping (because it drives their fundamental characteristics). All the other elements are particulars of how the particular architecture is constructed – and in the case of “hardware appliance” or “software-only” (or the seemingly new “server/storage hyperconverged”) are all just variations.
Don’t get me wrong, these variations MATTER (in many ways!), but are not fundamental.
One last point before diving in… It’s in our human nature to ask “which of these is right/best”. The answer is “for any given situation or workload there might be a “best”, but one CANNOT say which is best for all workloads, or “today”, “tomorrow”. Workloads drive architectures – not the other way around.
One of the funniest recent examples is “Flash will drive to an ‘all flash datacenter’!” debate on Twitter. These usually come from one of my compatriots from an all-flash startup or an EMCer who is a XtremIO specialist :-) I love passion in all forms – and there is NO QUESTION that Flash is disrupting use cases where performance and latency is the primary determinant (which is many), but guys as much as I love you – you are wrong :-)
Here’s one recent web-scale EMC customer. Most of you use this every day whether you know it or not. Could this be served by all-flash? Umm, no.
That one is big. This one is HUGE. It’s 10 times bigger. Again – most of you use this every day whether you know it or not. Could this be served by all-flash? Umm, no.
The counter argument (again, will always come someone who has a world-view where they are a hammer and everything looks like a nail) is that the flash roadmap (combined with dedupe – but of course not all workloads are dedupe and compression friendly) will eventually eclipse magnetic media.
Well – to some degree, yeah – flash will continuously get lower cost $/GB. I actually thought it would happen faster, but I just saw a 1TB SSD for $550 – that’s a lot of progress (UPDATE: note that larger, and larger capacity cMLC is coming online all the time, and we will embrace and lead there too). Of course, the magnetic folks aren’t standing still either – and in the 2017-2018 window where they maybe will intersect, there will be new players on the scene (Phase Change and Carbon Nanotube most likely IMO – and where our advanced R&D is now). But – the point isn’t magnetic vs. flash, or even software vs. hardware (example 1 was a hardware appliance – Isilon; example 2 is an EMC software-only stack on COTS hardware) – but architectures.
Also note – these architectural design centers are SO FUNDAMENTAL that it is nearly impossible to change from one to another without in essence re-writing everything… so building a different stack. So, if an persistence layer is born in one architecture category, it will eventually mature, expand, and then ultimately die in that category.
Here are the four categories that capture the entire current universe of persisting information:
Type 1: Clustered Architectures. These are defined by not sharing memory between nodes, and ultimately that the data “lives” behind a single node. One “cue” that what you are looking at is a clustered architecture is that devices sometimes “tresspass” even if they can be “accessed via multiple nodes”. Another cue is that you can point to a brain and then point at some media and say “this brain accesses data that is behind this set of stuff”. They can be drawn like this diagram (where blue is some sort of CPU/memory/IO complex, and green is some sort of persistence media – commonly flash or magnetic media).
Invariably – the “path” of an IO in Type 1 architectures is SUPER direct, SUPER short. You have some sort of low latency interconnect between brains for HA purposes (because some sort of mirroring of IO and usually cache state is needed) – but otherwise you’re going through a relatively direct code path. This means they are low latency software stacks, and relatively simple. Which in turn means they are often the most feature rich (easy to add data services). It’s also not a coincidence that most startups start with a stack that looks like this.
(Note that a non-HA server PCIe hardware (think Fusion-IO cards, or EMC XtremCache cards) are a subset of this class – and dispense with the HA piece. If they layer on some software for a distributed storage, coherency, and HA model – you can look at that software stack and it will be one of these 4 architecture types I’m describing in this post)
You can layer “federation models” on top of this “Type 1” architecture category to make them “more scaley-outy” from a management standpoint. They also then “bounce the IO” until it gets to the brain that has the data. Those “federated” models can also use data mobility approaches to rebalance between brains and persistence pools. in VNX land, that’s “VDM mobility” as an example. However, IMO it’s a stretch to call it a “scale out” architecture – and most customers in my experience tend to agree.
The reason is that the data itself always behind “a brain”, and in some cases “behind an enclosure”. It may be able to move – but is always in one place or another. Upside, you maintain the relatively “low cost” (in terms of cycles) and “low latency” write. Downside, you still have the data being ultimately served by a single “brain” (but can have some indirect access from one other). Balancing and tuning are important unlike in type 2 and 3 architectures (which we’ll get to), that tends to “just happen”.
Invariably (no value judgement!), this federation abstraction layer also adds some additional code for redirection – and adds some latency. This is similar to the additional latency/code complexity you see in Type 2 and 3 architectures, and offsets some of the good of the Type 1 model. To give examples, in the case of UCS Invicta, it seems that this is the role of the “Silicon Storage Router” as an example. In the case of NetApp FAS 8.x in cluster mode, it’s embedded into the code, but people more familiar than I can comment on extra code complexity (and benefit) that comes from that.
A VNX or NetApp FAS, Pure, Tintri, Nimble, Nexenta – and (I believe) UCS Invicta/UCS are examples of clustered architectures. Note how some of these are “hardware appliances”, some are “software only”, and some are “software only packaged as hardware appliances”. Note they are WILDLY different in their data services, design centers (Pure and UCS Invicta/Whiptail as examples are designed around all-flash media), but all the examples architecturally lands in this bucket.
To highlight how wildly the data services in a given architecture can vary, when you hyper-attune the data services for backup workloads – the software stack becomes Data Domain – a NAS implementation designed with super-focus on being the best purpose-built backup appliance (and most successful) on the planet. But, it can STILL be classified into our “phylum” – it’s a “Type 1 architecture”.
On to the next architectural bucket!
Type 2: Tightly coupled, scale out architectures. These are defined by using shared memory (think cache and some types of metadata) between nodes, and the data itself is distributed across some number nodes. This architectural model needs to deal with VERY large amount of inter-node communication on for all sorts of operations.
The defining element of shared memory models is critical to these designs. Historically, it enabled “symmetric” IO paths through all brains (the lower diagram). It was originally designed so that in failure (planned or unplanned) modes, IO operations would remain relatively balanced. This was the “origin story” of things like Symmetrix, IBM DS, HDS USP and VSP. They share their cache in some form – so that an IO can be serviced from any brain.
The upper diagram is XtremIO. Unlike the loosely coupled Type 2 architecture which at first glance looks similar – the shared and distributed metadata model means that not only does it use IB but is also dependent on RDMA to be able to share that metadata between nodes (it also has each node as an HA pair). Note how when people look at Isilon and XtremIO as “architecturally similar” they are way off. Yes, they both have a scale-out architecture. Yes they both use IB as an interconnect. Isilon, unlike XtremIO, uses IB for extreme low latency internode communication remains loosely coupled. There is no “shared memory” between nodes. Isilon could use ethernet between nodes (and in fact that’s how the Isilon virtual applance works) – but would just increase the latency of IOs. XtremIO depends on RDMA.
While I’ve drawn the two diagrams differently – they are actually the SAME architecturally. Pairs of HA controllers, each using a shared memory model, with some form of very low latency interconnect. VMAX uses a proprietary interconnect, but could use IB in the future.
There is a TON of additional code complexity that comes in tightly coupled designs. It’s not a coincidence that they are “relatively more rare”. It’s also not a coincidence that often they don’t see the same rate of data services being added in their core code stack. It’s simply a more complex computing problem.
The upside is failure behavior (symmetric IO paths across brains), and in the XtremIO case, a very unique capability in AFA land.
In XtremIO, that core architecture means all data services are distributed. It also is the only scale-out AFA on the market (though dynamic node add/remove is not in the code yet). It’s also at the root of XtremIO having “dedupe that is always on and ‘free’” (i.e. it’s intrinsic). It does means that XtremIO has taken longer to bake, and additional data services are coming. It is much harder to engineer.
BTW – sometimes it’s worth taking the time to do things right. On today’s Q4 analyst call – we talked about our Q4 XtremIO business. It was in a “Directed Availability” period for about 6 months in 2013. 2013 was a very, VERY active AFA year - a LOT of all-flash new entrant players. XtremIO was only Generally Available for 6 WEEKS in Q4. In that time period, we sold and shipped enough XtremIO for XtremIO to become the unit, volume, and revenue all-flash array market leader… (and it still doesn’t have snapshots, remote replication or dynamic scale-out… but will soon! :-)
Here’s the interesting thing to understand about “loosely coupled” (Type 2) and “tightly coupled (Type 3). The more tightly coupled an architecture – the more it can deliver low latency and PREDICTABLY low latency. Conversely, the more tightly coupled an architecture, the harder it is to add nodes and scale. This makes sense – when you share memory space, it’s one tightly coupled distributed system. The complexity of issues and bugs grows and grows. This is why a VMAX right now can have 16 brains (in 8 Storage Engines), and XtremIO can have 8 brains (in 4 X-Bricks) – to soon expand to 16 brains and 8 X-bricks). Getting those architectures to double that, or quadruple that is an insanely difficult engineering problem. Conversely, VSAN’s scaling target is the “size of vSphere clusters” (currently 32), Isilon can have 100+ nodes and ScaleIO can have hundreds, in some cases even 1000+ nodes participating – but they are still transactional-class Type 2 architectures.
Type 3: Loosely coupled, scale out architectures. These are defined by not using shared memory between nodes, but the data itself is distributed across some number nodes. This architectural model needs to deal with a larger amount of inter-node communication on writes (and “expensive” operation in terms of cycles) – because the data is distributed in SOME way. BUT they are transactional – the writes are distributed but coherent always.
Note a thing in the diagram – in these architectures, a node is generally not considered “HA”, the resilience comes from data copies and distribution.
Often latency is “hidden” by a low-latency write destage (NVRAM, SSDs, that sort of thing) – but ultimately there is always more “bouncing around” communciations and copies (ergo extra IOs) than in the simple “cluster” architectures since the write is distributed (for protection and for scale-out options on read operations).
Sometimes this type of architecture group the nodes into some subgrouping and have other nodes that are “mappers” (metadata nodes), but that is in effect that “federation” idea you saw from the above.
The upside of these designs is that they are simple in their operations and scaling. They can be very good on distributed reads in some cases (since the data is in more than one place, sometimes in many places and can be serviced by multiple nodes).
They are also interesting in that they are also a “good natural fit” for server/storage software and hyper-convergence for workloads that are transactional (because their distributed nature means you can put them on a non-HA server easily, and they are loosely coupled, so plain ol’ awesome ethernet can do just fine – nothing esoteric needed).
EMC ScaleIO is an example of this type of architecture. So is EMC Isilon. So is VSAN. So is Nutanix and Simplivity. Again, just like the previous example, it’s almost criminal to lump these together – as their data services, data distribution, scaling strengths/weaknesses and maximums/minimums all vary wildly.
Their loose coupling means that often they CAN scale to moderately (and in some, more than moderately!) large number of nodes. They DON’T share memory between nodes. Their code in each node is “independent” in a sense from other nodes. That said, the devil is in the details:
- The more distributed the data on writes, the higher the latency (one example - Isilon is SUPER distributed on a file IO – with most file landing on several nodes, so write latency tends to be higher) and lower effective IOPs (as data is being put in many places). Isilon with each major release gets lower and lower latency – and customers should expect that – but the degree of extra IOs and distributed writes means it will never be a latency or IOps machine. It can, however be a bandwidth monster.
- If you use a lower amount of distribution (even if the nodes in the cluster is higher), latencies can be lower – but the flip side is true – you don’t get the same degree of potential parallelism on reading data. That is the design center of VSAN as an example. It uses the “VM is an object” model as the vehicle for maintaining multiple copies across the VSAN cluster. It’s expected that a VM will be accessed by a given host. In fact, VSAN uses this to work to “bias” a VM towards nodes that happen to have it’s data. Conversely, people using VSAN can see for themselves what increasing the object copy policy to large numbers does to latency and system-wide IOps – hint, more copies = higher and higher load on the system as a whole (and it’s non-linear, as you would expect from that type of design). That’s not a problem for the VSAN design center – and “VM object awareness” is a big benefit.
- You can have low write latency AND high scaling AND high parallelism on reading – but only if you finely sub-stratify the the data, BUT only write a small number of copies. That’s the ScaleIO design point. Each device/volume is sliced into many pieces (
1GBUPDATE – the correct value is 1MB and is tunable), and distributed across all the nodes participating. Reads, rebuilds, and redistribution are stratosphericly high bandwidth with large cluster counts. latency on writes can be sub 1ms with a robust network and SSDs/PCIe Flash on the nodes in the cluster. However, each write is committed to 2 nodes. The parallelism comes into effect because you are grabing data across many slices and nodes. Of course, unlike VSAN – it doesn’t have VM object awareness (but if it did, it wouldn’t scale the same way!)
Also – remember – architecture is separate from implementation. If you look at those – you see some that use Ethernet to interconnect, others use IB. Some are software only, some are hardware/software. But – they do share the architectural design center.
One common thing is that while the interconnect may vary, all the examples treat the distributed network on writes as a critical dependency. You distribute the write, and it’s not done until it’s sitting in more than one place. While distributed writes are the design center, they are transactional and atomic. They all have to implement careful controls to make sure this is always true. They also need to deal with “failure domains” getting big. These two things INVARIABLY set the “upper limits” of their scaling. Later when we talk about Type 4 (which physically looks similar), these are critical differences.
This is the first litmus test to see if you’re following closely. While I don’t claim to be an expert on Cisco UCS Invicta – is it a Type 1 (Clustered) or Type 3 (loosely coupled scale out)? Physically, it looks like a Type 3. After all, it’s a set of UCS C-series servers, running the Invicta (formerly Whiptail) software stack, and interconnected with Ethernet, right? Hint: implementation varies, but the architectures are fundamental :-)
In the UCS Invicta case – the data is behind an node (a UCS server with MLC based flash). A single appliance node is not HA, it’s a standalone server – an “Invicta Appliance”. It can directly present a storage target (LUN). This makes you lean toward “Type 2” – assuming that perhaps as you add more nodes, it “scales out” in a loosely coupled way, like ScaleIO, or VSAN.
…. But, the way you have more than one node seems to be (again, Cisco team, please correct me if I’m off base) is to configure it and migrate to a “Invicta Scaling Appliance”. In that configuration, you have some nodes that are “Silicon Storage Routers” (SSR) and address storage from multiple appliance-style nodes that house the data. The data is accessed through a single SSR node (active), but can be “tresspassed” to another SSR node that is acting as the HA pair. The data itself is always in a single UCS C-Series “data node” itself – and does NOT scale out or spread out. So – what type of architecture is it? Regardless of how it may physically LOOK, it’s a “Type 1”. The SSR is a cluster (can be more than 2, apparently). in the Scaling Appliance configuration, each UCS C-Series MLC packing server is performing a function analagous to what in a VNX or NetApp FAS is a “disk enclosure” – not connected via SAS, but similar architecturally. UCS Invicta is more like a software VNX or NetApp ONTAP stack running on a server than it is like VSAN or ScaleIO. It is a “Type 1”.
On to the third architectural type!!
on to the fourth, and most “alien” (for most enterprises at least) architecture!
Type 4: Distributed, shared nothing architectures. These are defined by using NO shared memory between nodes, and the data itself is distributed across some number nodes – but in a lazy, non-transactional way. These architectures chunk up the data and it lives on one node and then periodically (sometimes) distributed copies for protection. They are not transactional.
While there is inevitably some interconnect between non-HA nodes (ethernet always – but only because it’s low cost and ubiquitous). The key difference when compared with “Type 2” and “Type 3” is that the storage stack isn’t tightly dependent on “transactional-ness”. The distribution of data can be forced sometimes, or lazy at others. The “correctness” isn’t necessarily true universally (but often the app stack above checks to make sure it’s using the right data. In the case of some workloads (like HDFS) the data is chunked and distributed to be co-resident with the portion of compute that needs that particular data.
This is actually the property that makes this type of architecture the most scalable of all four.
That’s just ONE of Type 4’s super-powers. They are super-simple. They are crazy easy to manage and operate at scale. They have absolutely no hardware dependencies – so can be on the lowest cost hardware possible, and are almost always software-only. They shrug off “petabyte” scale like others react to “terabytes”. They use object and non-POSIX filesystems (both often layered on top of a local filesystem on each node which is used for basic stuff).
You can layer or “front” these architectures with block and NAS transactional presentation models, but that only waters down what they are awesome at. Layering on top of these Type 4 stacks with a Type 1, 2, or 3 is NOT THE RIGHT WAY to build a transactional stack :-)
When you have an application that can work around the limits, you can exploit the super-powers of this “Type 4” architecture.
The 2nd customer example I gave above (with more than 200,000 spindles) is one of those use cases. If you think of the persistence layer under DropBox, or Syncplicity, or iCloud, or the way pictures are stored in almost all Web 2.0 applications including Facebook and eBay, or YouTube videos – it’s all on Type 4 architectures. Hadoop clusters all process information persisted in a Type 4 architecture. This category is relatively rare in enterprises – but all around us at the same time, and growing like crazy.
What are implementation examples of Type 4 architectures? Well, AWS S3 (simple storage services) is one (BTW no one outside AWS knows how EBS works, but I would BET that it’s a Type 3). So is Haystack used in Facebook. So is Atmos from EMC, as well as the ViPR Object stack. So is Ceph. So is Swift in Openstack. So is HDFS. EMC Centera is one. Interestingly, while people think of us as the “Enterprise array company”, EMC provides the most widely deployed commercially available object stacks (Atmos, Centera and rapidly growing ViPR object) – as both software-only and software/hardware appliances.
To further complicate things – many of the list above are defined by their API, and can have multiple implementations. Saying “HDFS” is like saying “NFS”. Ditto with “Swift”. There are multiple implementations of some of the above. Example – the ViPR object stack can present via S3, Swift, Atmos object APIs (and in the future Centera!), and also (and amazingly, simultaneously!) HDFS. It may be obvious, but Atmos has a long life ahead of it, and so does Centera – but as both as APIs, not products. Implementations may vary, APIs are constant – which is very good for customers :-) The Atmos, Centera (and EMC versions of HDFS, S3 and Swift) implementation are all merging into the ViPR object stack.
Note once again how “physical appearance” would make you misclassify Type 4 architectures into the wrong phylum – because Type 2 and Type 4 often physically look alike. At the PHYISCAL layer, this looks a lot like ScaleIO, or VSAN or Nutanix – after all it’s just severs with Ethernet! But – they are transactional, and these Type 4 architectures are not.
This is the second litmus test to see if you’re following. Use this to “test” against the UCS Invicta design center. Yup, physically it LOOKS like it could be one of these (just servers connected via ethernet!), but architecturally it will never scale architecturally for Type 4 workloads – because, it’s not a Type 4. It’s a Type 1 (and one, that like Pure is designed for AFA).
My goodness, if you’re still with me – thank you for your investment of time and attention. Re-reading the above, it’s amazing that I managed to successfully get married and have children :-) You should read my internal emails/posts to my EMC teammates – they are equally long and verbose :-p
What’s the point of all this?
Persistence is a fascinating domain, a “kingdom” in the universe of IT.
There is a lot to learn, and play with. Yup – there is a lot there – and that’s complexity. Those stacks are so different, their diversity is like generic pool diversity – a good thing! But for a customer, and for the industry – we MUST mask that complexity. BTW – that’s why we’re so maniacally focused on the ViPR Controller, and doing everything we can to make it OPEN and a PLATFORM (making it freely available – see here - is just the start)
But storage is so booooring… it’s all the way at the bottom. What do I mean?
Well, the “User” is at the top, and “applications” exist to serve the user – Infrastructure (including modern SDDC, virtualized, pooled, abstracted and automated infrastructure) is there to SERVE the application, and in turn the user. And, in turn, workloads drive architecture – not the other way around. Storage is all the way at the BOTTOM of this stack.
So, if in the “hierarchy of IT” you have: User->Application/SaaS->PaaS->IaaS->Infrastructure, and at the bottom of Infrastructure, storage – why the heck am I passionate about this topic?
Answer: In the end, any app, any PaaS stack must compute or process against SOMETHING. It’s not about storage – it’s about INFORMATION. These 4 architectural types are all ways of supporting different types of information, different workloads. Information pops right back up to the user in terms of importance. After all, the raison-d’etre of the application is to give the user some way of interacting with information. That’s why “persistence architectures” is so important a topic and part of the world we live in. And that’s a really, really cool thing.
As always – input welcome!