One of my most popular and furiously debated blog posts in recent history was a long diatribe on the nature of the diverse “persistence ecosystem” – the “Storage Tree of Life” (here)
It outlined how you could “group” all storage architectures into four major “Phylum” (branches), and how all persistence architectures can be classified into these buckets. Each grouping has fundamental architectural strengths and weaknesses. These architectural behaviors are the same whether the stack “manifests” as software on commodity or as hardware appliances. They are:
- Type 1: Tightly Coupled Scale Up (Clustered) Architectures (great swiss army knives)
- Type 2: Tightly Coupled Scale Out (Distributed) Architectures (great “linear always”)
- Type 3: Loosely Coupled Scale Out (Distributed) Architectures (great scaling model and “transactional-ness”)
- Type 4: Distributed Shared Nothing Architectures (ultimate scaling model for non-classic, non-transactional models)
(Note – a lot of feedback was to flip “which one was 2, and which one was 3” – which creates a more logical flow from “most tightly coupled” to “least tightly coupled”. Just like all “Phylum” groupings, they are somewhat arbritrary – particularly in naming, so… since it makes sense to others, I’ve updated the original post)
Everything (I have yet to find an exception) can be classified in these categories (even local PCIe Flash DAS is a subset of Type 1, where the “controller count” is 1).
They look like this:
This begs an interesting question. Are new “Phylum” being evolved? Answer = YES.
Today, at EMC World we shone a light onto some of the crazy coolness behind a new “5th Phylum”
Just like in the amazing diversity of life that surrounds us (and we participate in!), the continued “mutation” (innovation) and “selection” (the emergence of new workloads is analagous to “environmental pressure”). Read on to learn a little more about what the heck I’m talking about!
In evolutionary biology, people furiously debate about which is more important: mutation mechanics; or selection mechanics. I think this is a weird (nonsensical?) debate. They are part of the same system.
To understand what’s going on, you need to understand “environmental pressure”. This is the cumulative pressure of the ecosytem on the organisms that live in it. In natural history, there are periods of “punctuated equilibrium” where all of a sudden, environmental pressures changed (floods, volcanic eruptions, extra-terrestrial impacts, changing oceanic conditions) where new diversity exploded or suddenly changed direction.
In the world of technology, an example of a “punctuated equilibrium” disruption is the thing often called “the 3rd Platform”, though I prefer to explain it in technical terms.
This picture sums it up in a nutshell. It’s all about the app, stupid :-)
In “Platform 2” apps noted on the left (which are enormous in number and critical nature, and will continue to grow!), the application stack inevitably had at the bottom a “data layer” which depended on a Relational Database (which used tight locking semantics and religiously enforced persistence logging to deliver a “absolute ACID correctness” model), and that in turn has expectations for “infrastructure resilience”.
This basic thing created Oracle. It created EMC. A great deal of the “infrastructure resilient” IT ecosystem grew up around that stack. Even things that are relatively new (think virtualization and VMware) are pretty centered around (but not limited to) this stack that expects “transactional persistence” + “I expect the infrastructure to be a tank”.
SIDEBAR: this “classic” application stack is an area of ongoing innovation/disruption itself. Think of the AFA industry and things like XtremIO as creating a huge “order of magnitude” improvement over magnetic media. Think of pervasive use of Flash in hybrid array designs. Think of new “hyper-converged stacks” like VSAN or ScaleIO that deliver “transactional persistence” and + “infrastructure resilience” via SDS data planes running on commodity hardware. These are all examples of disruption in “traditional” virtualized application stacks. So…. Don’t think “new” vs “old”. “Platform 2” is a vibrant, dynamic, expanding, and evolving ecosystem.
In “Platform 3” apps (which are smaller in number, and though widely used in new apps – relatively new in the classic “enterprise” are and area of furious early growth), the application stack is different (and linked to the new social, mobile ways that people use them, and interact with big data). The Application Fabric is inevitably a PaaS model using one or more frameworks. The Data Fabric layer often has 3 very different buckets. It has a “data lake” which is a composite of:
- A big MapReduce capability (Pivotal HD, Cloudera, Hortonworks, etc.)
- One or more Distributed NoSQL layers (including those that do create SQL interfaces) – like HAWQ, Hive, HBASE, Cassandra, MongoDB, etc – depending on the specific needs (lots of variation here)
- One or more In Memory Data layer – like key value stores, SAP HANA, The new In-Memory stuff in SQL 2014, Pivotal Gemfire, etc (again, lots of variation depending on the specific needs)
This is the FIRST thing to understand. This new application stack is the “new workload” that is driving furious innovation in the parts that “serve” it. It is the “Environmental Pressure”. Customers making choices about how to build the infrastructure to support the workloads are the “Natural Selection” mechanism. Disruptive new architectures/innovation are the gene mutation in this analogy that I’m REALLY stretching :-)
- MapReduce/Distributed NoSQL/SQL are driving requirements away from “infrastructure resilience” towards “application resilience” at the persistence layer.
- MapReduce/Distributed NoSQL/SQL are driving to giant HDFS/Object APIs. Developers generally never ask for a “POSIX compliant filesystem” or “please give me a block stack”.
- When it’s all about bandwidth, MapReduce/Distributed NoSQL/SQL are driving towards more and more commodity off-the shelf (COTS) + Software models, until people realize that they want to have that data accessed in all sorts of ways – and not need to copy it around (in some cases you CAN’T it’s so big). So things that can express in NAS, Object and HDFS forms – that’s VERY powerful.
This is why EMC is maniacally focused on ViPR Object and HDFS and Isilon stacks for the big swaths of the “data lake”, but there’s a clear opportunity around the In-Memory space.
But…
- … What do these new “in memory” models need, or for that matter low-latency NoSQL/SQL distributed databases? Frankly – commodity servers, but with gobs of bandwidth, INCREDIBLY low latency, and gobs of memory (and in this case, latency is everything).
On to the SECOND thing to understand. Innovation (just like mutation) is constant, and almost random. Success is about the innovation (mutation) fitting a workload (the environment) – and then natural selection takes course (which isn’t random in the world of IT – but has a big question of execution capability).
I’ll try to make this simple in a picture.
This is a modern x86 server (think Xeon-E7).
You have 10’s of KB of registers, and an on-die SRAM L1, L2, L3 cache – with latencies in nanoseconds. When you leave the CPU itself, you have an Direct Memory Interface to DRAM, with latencies in 10’s of nanoseconds, and an “inter-CPU core” NUMA architecture with huge bandwidth and relatively low latencies.
When you leave the CPU complex (but are still “in the server”) and transit the PCIe bus – latencies pop up to 10’s of microseconds.
When you then are getting in an out of the flash itself (still “inside the server” hanging off the PCIe bus), you add many tens of microseconds to read/write to the flash itself – call it 50 microseconds. There’s also a ton of hardware innovation happening around the flash media handling (and software innovation on top of that).
If god forbid, you leave the server entirely and get into an all-flash array (almost regardless of protocol – RDMA, FC, etc – because it’s not about the link, it’s about the target) you are talking 100’s of microseconds in practice. This isn’t so much about “cables” and “media” but the SOFTWARE STACKS inside the AFAs. Even those that use direct object mapping (like XtremIO) have latecies that are in hundreds of microseconds (inevitably they have some block stack). Those are those that use journalling and other filesystem or pseudo filesystem layout mechanics. It’s not a coincidence that their latencies pop up to MANY hundreds of microseconds – in some cases, breaking the millisecond barrier. And of course, some suffer more than others as utilization climbs, and system-level garbage collection kicks in. Furthermore – FC is plenty fast, but it serializes things much more than PCIe. Serialization + Latency = bad when the goal is system-level latency behavior. But hey, 100’s of microseconds (even a millisecond) doesn’t sound like much, right?
Answer: it is a big issue if you view latency as a problem, which is very important for these workloads, and for in-memory extension. The explanation of why this isn’t apparent to “everyone” is rooted in “humans are bad at math”.
50 microseconds is 50,000 nanoseconds. 500 microseconds is 500,000 nanoseconds. Still doesn’t sound like much?
Let’s imagine for a moment YOU are the CPU, and YOU’RE doing a task. Let’s humanize the timescales:
- Let’s say the task takes 1 nanosecond, but a nanosecond is “one second” in human time. That’s pretty fast in “human time”. When I’m REALLY cooking on something, I feel like I’m working in these timescales – i’m working with my mental “registers”.
- That means that every time you’re doing a task and you need to talk to DRAM, you need to pause for “10 seconds” and wait for it to get back to you. Not bad. That’s equivalent to talking to a colleague and working on your phone: “I’m multitasking on my phone and sending a tweet, but am still with you!” (which is annoying, but not fatal)
- If you need to traverse the PCIe bus and get in and out of flash on a PCIe card, you need to pause for “10 hours” (50 microseconds). Wow. That like working on your work task but every time you need to do anything – you have to deliver a package to yourself in a serialized task.
- If you need to leave the server entirely, and get to an AFA with 500 microseconds of latency, you’re going to have to be REALLY PATIENT. It’s going to take about a work week of chilling out. That’s the equivalent of the voyages to the new world and back during the era of Christopher Columbus where no one knew anything for a LONG TIME. A brave voyage indeed.
Am I saying “AFA” = bad? Goodness no. For “non-in-memory workloads” – GOOD AFAs (consistent, predictable low sub millisecond latencies with data services, broad IO characteristics) a are a “rocket” relative to hybrids (which themselves can use Flash to cache/buffer and tier – though won’t have the “always flat” AFA characteristic).
But – this is an area where HARDWARE innovation (there is a ton of software magic, but this isn’t going to be tagged as SDS by anyone) is possible in the form of a “top of rack” shared pool of Flash could be possible. We took an initial look at this organically (as part of “Project Thunder” – some info here), and learned a lot (and much of the learning is making it into all parts of the portfolio).
The FIRST downside learnt through Project Thunder was the following: Flash on it’s own isn’t fast enough. We learnt that “really fast shared PCIe flash” designs that just mapped via RDMA didn’t have enough of a benefit over AFAs (which were evolving to offer richer data services for things were OK with 250-500 microseconds of latency (relational databases, VDI, etc).
If we really wanted to attack the latency problem, there’s a need for an “in memory” version of what the original “hey, what if we fronted slow magnetic media with cache and controllers?” that gave birth to the storage industry (which was the “cached array” hardware innovation).
And what better team to do it than the crack crew of Andy Bechtolsheim (founder of Sun Microsystems, Granite Systems and Arista), Bill Moore (First employee at 3PAR, primary CPU and server bring up engineer at Sun Microsystems, Co-led ZFS development at Sun, and served as Chief Engineer for Storage) and the broader DSSD team, that has been working on this problem in stealth mode for more than 3 years.
The SECOND downside learnt through Project Thunder that applied in our thinking on the topic… The ultimate manifestation of this would benefit from something even lower latency than PCIe or IB today, something closer to the Direct Memory Interface class of latency to keep us in the “days” of latency, not “weeks” (in human time).
…interesting stuff happening here – stay tuned ;-)
And there was a THIRD critical thing we learnt through Project Thunder. In many cases, the ideal interface isn’t RDMA over IB/Ethernet (memory mapping), but rather that the way the developer DESIRED way to interact with the persistence layer was directly with the application API. You could do this with NVMe, and it could be possible to “bolt right in” to HDFS, key value stores and others. This is what DSSD does.
Net: DSSD is TRUE “top of rack, pooled server-memory-flash”. It has a density and performance envelope in a “new category band”.
You can see why it was such a head scratcher for many (in EMC at least) when things that are clearly “AFA” software stacks running on servers were called “server memory”. Umm, if it presents a “LUN”, it’s not “server memory”.
While I’ve lightly touched on some of the main drivers and architectural ideas behind DSSD, I won’t give away the secret sauce yet. There’s a lot more there there – and we’ll talk a little more at Area 52 tonight.
Like XtremIO, when EMC sees something really cool really early – we don’t hesitate. We were an early investor in DSSD. Even if the startup needs support, investment (and latitude and independence) to keep going to the point where their baby has world impact. Just like with XtremIO, there will be a Directed Availability period before GA.
So – with all that said, a diagram perhaps?
The best way to draw it that I can think of is like the below: DSSD is very dense package of flash (for a ton of IOps), each with a small controller (all grouped and pooled together) and even lower latency memory to “buffer” the flash latency – and interconnect to hosts via a super-low-latency interconnect. How dense? Think “order of magnitude” over current options. Add in the software to manage, pool, automate – and ultimately present in different ways, and you have our “5th Phylum”
There is no “block stack” inside the platform. There is no “file stack” (though in these could be layered on top – and DSSD already has implemented in POSIX-compliant model on top of their stack as a “we could do this if needed” PoC).
Put another way: DSSD != “SCSI WRITE (block address)” + DSSD !=”file open (file pointer + byte offset).
DSSD – doesn’t require any of those file/block semantics between the flash read/write model. It can expose this via libHDFS or object semantics, or directly mapping to key value stores (with a PCIe/NVMe connection). If you want direct memory mapping over RDMA and over direct PCIe NVMe, it can do that too!
Here’s how the stack compares with local server HDD and SSD/PCIe-attached “DAS” vs. DSSD’s shared direct access model:
If you think “Top Secret” latency stack is a tease (it kinda is) – well, let’s just say “order of magnitude improvement”
Note that if you want to compare the “non local device category” – ergo a network-attached block device (via a SAN or NAS) – from the PCIe HBA on down there’s a network latency (usually low microseconds), and then you hit the software stack of the array itself (which has some of the OS/Kernel stack stuff in it (mapping and data services layers, POSIX F/S, LVMs and then it’s own hardware stack)
This is why some recent coverage where people look at arrays (even new AFA startups) as memory mapped models, or all-flash arrays “extending the life of SANs” – well, I tend to think that’s a bad way to think. Most AFAs are a lower latency version of the existing architectural model (that’s not a bad thing!) targeting workloads that are mainstream today. DSSD is something entirely different – and targeting future workloads.
Interestingly – who is CLOSEST to this internal “direct flash object mapping” model? The only AFA on the market that took a “clean sheet” approach and asked the following questions:
- “Do I NEED a block mapping layer in the array, or just a block presentation of an object model?”
- “Do I **NEED** a POSIX filesystem?” inside the array?
- “Do I **NEED** a log-based layout model for data services in the array?”
The only AFA startup that took that architectural approach to the design target of the AFA was XtremIO, and it is in a class by itself in THAT market because of this FUNDAMENTAL architectural advantage. While admittedly catching up on certain features (compression and in-array replication as examples), architecture (which makes it scale out, which makes it so linear, which makes it have an inline dedupe and snapshot engine that is incredible) tends to win over features – and you’ll see more from XtremIO this week :-)
So, where does DSSD this fit in the ecosystem of persistence stacks?
DSSD is the “ultra hot” edge – infrastructure that supports the in-memory database like SAP HANA, Gemfire, and key value stores like memcached. DSSD changes the economics of in-memory databases as you can have far more memory at a different economic point.
DSSD changes the envelope of IOps and latency for very low latency-dependent distributed NoSQL/SQL databases used for real-time analytics.
The other architectures (whether software like ScaleIO, or hardware like XtremIO/VNX/VMAX or their software-only variations like vVNX) are the continuum of the hot/warm persistence layer that supports transactional workloads (relational databases and the next band of distributed NoSQL/SQL models), and architectures like vOneFS/Isilon and ViPR Object/HDFS are the cold core (and underpin giant data lakes with HDFS, NAS and Object interfaces – yes Hadoop, but also including some distributed databases)
Inevitably, there will be some (likely competitors trying to find a wedge bullet and respond) saying “hey, what does this mean for XtremIO?”. The answer is simple: nothing – particularly if you’re still following me in this blog post :-)
XtremIO is awesome. It’s taking the AFA market by storm. Customers and partners love it. In our opinion (likely biased :-) it’s the best AFA on the market.
It’s not the right architecture for an in-memory database, because it’s an storage array and storage target, not server memory. You can see how it fits into the overall ecosystem above, and also in table form below.
Net? You Interesting how diversity is fundamentally good – in natural ecosystems, in organizations, and in IT too :-)
Welcome to the DSSD to the EMC family, and welcome world to a whole new 5th “Phylum” in the world of persistence! Input/commentary always welcome!
Excellent read and good too understand that for In Memory DB EMC has already a Product.
To me this is the right answer for future SAP Hana Storage Systems.
Posted by: Markus Zappolino | May 05, 2014 at 03:30 PM
Lets wait for the latency figures but there probably is a 6th phylum you have not touched on yet that gets us from Ultra to Extreme Hot :) using DRAM for the data with an adjacent persistence layer (nvram/flash/disk) to deal with failures. An example of this is RAMCloud.
Posted by: Peter Marelas | May 06, 2014 at 12:49 AM
Good read but my brain hurts... Can't wait to learn about the data services. I hope that those advanced data services that customers have come accustom to on EMC products are at or above what we have today.
Posted by: EMC SockMonkey | May 06, 2014 at 06:07 PM
Great write up Chad! After last weeks EMCWorld sessions and now finally re-reading (more importantly understanding the DSSD effect)...I am am super fired up of the face melting innovation we are building for SAP HANA workloads. It's also super cool that EMC is willing to challenge the status quo solving the similar problems in a better, faster, and cheaper way (sometimes cannibalizing existing platforms).
P.S. I am planning to write a blog post on DSSD as it relates to SAP, however mine will not be this technical :-)
Disclosure - I am an EMC employee
Posted by: Henrikwagner73 | May 13, 2014 at 11:15 AM
What DSSD stands for?
Posted by: gunauro | March 03, 2016 at 04:32 AM
Interesting that this would be a new phylum, to me it looks like a variant or maybe evolution of type 2. Data still resides on media which sits behind a couple of brains that has a complete view. Sure, it's not in the data path as the other type 2's, but does that really matter from a tree of life view.
Posted by: Johan | September 26, 2016 at 05:35 AM
I'll reply back to my own post. The more and more I've come to think about this, the clearer it gets that it actually is a 5th branch. I first saw it as a stretch from type 1 or 2, but when you realise it really is all about the app and how to interact with media in the shortest path possible and getting away from all "stuff" that sits in-between.
Posted by: Johan | September 29, 2016 at 05:14 AM