Truly, this is the year of All-Flash for transactional workloads – traditional workloads, new workloads, and emerging workloads that are only now a twinkle in people’s eyes.
Today is the big day where DSSD is born. For any product team – the first General Availability milestone is a huge event, the birth of a child, the culmination of their work, their passion, their intellect, their blood/sweat/tears.
Congrats to the team!
What’s the punchline? DSSD is fast, dense, powerful – as in an order of magnitude more.
10 Million IOps in 5U.
100 GBps of bandwidth in 5U.
100 usec of latency
144TB in 5U.
WOW.
We’re setting a new bar with several industry firsts (“quantum leaps” on a leap day – get it? :-)
- Response times from a shared device that match or exceed locally attached PCIe devices.
- Recoverability and availability in an extremely low-latency use of NAND (vs. the fact that local NAND is dependent on the host being alive).
- Connectivity – a first of PCIe NVMe host attach – at scale.
- Speed – in every dimension.
- Flash – density, power, cooling, you name it.
- Access – first examples of low-latency native HDFS, object, POSIX pseudo file models.
This needs some explanation.
DSSD connects to hosts via a PCIe NVMe interface (not FC, not Ethernet – just not fast enough, just not parallel enough). Over time, I would wager that PCIe external interfaces and cabling will get very standard, but we’re not there yet. It’s just like the 10GbE LOM journey. Today you need to have a client interface.
Each card has 2 PCIe gen3 x4 lane interfaces – with each lane supporting a bandwidth of 7.877 Gbit/s (984.6 MB/s)… But the key is that when comparing with Ethernet and FC: 1) it’s not a serialized bus; 2) NVMe is not a serialized protocol; 3) the transport-level latency is much lower.
I’ve heard some say “there’s no point in host PCIe/NVMe attach relative to FC – as the bandwidth/latency is sufficient”. That answer is right for all-flash storage arrays (XtremIO, VMAX All-Flash, and their competitors). That answer is NOT right for Rack-Scale Flash.
Now – let’s talk about how one actually gets stuff in and out :-) There are three ways to interface with DSSD from a protocol standpoint:
- The block interface is via client kernel mode drivers that make it look like a SCSI device. Most primitive model, but hey – there’s a lot of stuff out there that is old-fashioned still.
- There is a native libHDFS library so it can directly “speak” HDFS. Super, super, SUPER fast HDFS. This is actually closer to the internal persistence model of DSSD, which is actually an object storage model.
- The Flood Direct Memory API (libFlood) is a semantically rich API to access the full set of things DSSD can do (“Flood” is the software stack of the device, not just the API). It can natively access objects, and even a POSIX-complaint file-like semantic model. Of course, it can also be used to build all sorts of application libraries (in fact libHDFS is an example). Any application can be modified, or new apps can be developed to use the Flood Direct Memory API “verbs” – the most obvious example would be any sort of key-value store. FYI, all data is stored on the D5 as some type of an Object, so of course it supports that most simple object interface.. The “libflood” API C-library includes commands to create, modify, destroy, read and write to objects.
So, where to use it? DSSD use cases are straightforward:
- A non-sexy, but really important use cases would be traditional apps whose data layer is old-fashioned and you just can’t find anything to make it go fast enough (think Oracle). These use the block driver – and it is wicked fast. Now, I want to be clear – for the MOST part – this domain is the domain of all-flash storage arrays (AFAs) like XtremIO and the new VMAX All Flash, since they have all the rich data services that many traditional apps expect. However, we’ve seen through the directed availability period that some people are up against the wall, and need more performance through sheer brute force.
- Example 1. Our Oracle testing shows DSSD being smoking fast, not though any specific database optimizations, but by brute force. With just one DSSD 5U appliance we can deliver 5.3M 8K IOPS. Customers should make their own comparisons to their other options – but we think we are 25% higher IOPS at 1/3 the latency and 1/3 the TCO – oh, and in 1/5th the rack space.
- Example 2. I met with a customer in Seattle a couple months back that had a critical app running on SQL Server and all IO was powered by honking monster hosts and they used the the biggest, baddest FusionIO cards they could find, but increasingly that was problematic – one because FusionIO is part of Sandisk and they struggled to support the customer, but more importantly – the cards just weren’t big enough, dense enough, fast enough. Now – we don’t have a Windows block driver yet, but this customers exact words to me were: “As soon as you can get it to me, I need 7 of these”
- New real-time Analytics apps – whether it’s HBASE or whether it’s anything that can leverage an HDFS persistence layer (pretty well the whole Apache Hadoop ecosystem), that whole ecosystem just got a supercharger in DSSD. Historically “HDFS” is associated with “high bandwidth, but also high latency – so batch type stuff and map-reduce”. DSSD does HDFS with performance on speedballs.
- Industry vertical apps. In this case, they write directly to a distributed filesystem sitting on the DSSD block interface, or better yet, they use the Flood API, and integrate directly. Have an app that needs a wicked fast (latency, throughput and bandwidth!), highly available persistence layer, and something dense and small? DSSD’s your thing. Examples would be: EDA: Chip design and verification (esp. Tape Out process); Life Sciences: Bioinformatics & Genomics; Financial Services: Various types of parallel Risk calculations; Manufacturing: Computational Fluid Dynamics (CFD); Computer Aided Engineering (CAE); Computer Aided Design (CAD); Oil & Gas: Seismic processing.
- High Performance Computing. Invariably, distributed filesystems are common all over the HPC market – and they always need to go faster. The TACC example is a great one.
Again – a huge congratulations to the whole team, and to everyone who has been on the journey – wherever they are.
Now – if you want to know the full “story behind the story” (like how is it that cool things like DSSD “appear”, when did the story start, how are “big moves” are planned for strategically, what’s the plan with DSSD in VxRack, etc)… Read on….
The story (from my eyes) is fascinating both from a technological and strategic lens, and starts a LONG time ago.
- In late 2010, EMC was evaluating the longer-term strategic impact of NAND (and the inevitable post-NAND persistence media like Phase-Change, Carbon Nanotube, etc) on the domain of storage, and application stacks.
- What was clear was that the impact was going to be massive and immensely disruptive in the world of “classic” transactional storage (“SANs”). This strategic planning was the pre-cursor to the AFA vs Hybrid battles that now wage in the marketplace. It was in 2008 where we started shipping SLC based NAND in DMX (the precursor to the original VMAX) for the first time and even way back then there was an observation that this was going to turn the storage industry upside down. That said – while it was patently obvious that “add SSDs” to traditional architectures was going to be important – it was also least disruptive move. BTW – this was around the same time we started investing (via EMC Ventures) in to the startup AFA and NAND supply chain ecosystem. Venture funding is an important vehicle for all the giants – a way to tap into other innovation ecosystems (if you want a great peak behind the curtain of how it works at EMC, you have to listen to episode 8 of the Hot Aisle podcast). We invested in XtremIO (and others) around that time. BTW – this is why when startups claim that the giants are asleep, they are foolish. Whether we can move fast, and disrupt ourselves is another question. Some giants can still dance – but it’s a struggle for all.
- For those that are curious – I pulled out an analysis from around that timeframe, and it pointed towards a $/GB inversion (the $/IOps inversion was almost immediate) in the 2015-2017 timeframe. We’re at that point now. It’s our PoV that a modern datacenter should have an all-flash architecture for transactional workloads – and that hybrids are close to “done” for this space (lots of room for magnetic media still in many non-transactional use cases). We were pretty close, timing wise.
- These “strategic planning” processes are also very “out of the normal structure of the business” – and happen across the group of companies (EMC, VMware, Pivotal, RSA). While the NAND media transition had clear implications to EMC, there were implications to VMware as well. There were very clear implications on in-memory database land (and led to acquisitions of both intellectual property and talent around Gemfire, Redis and other things). There were also strong implications on SDS and hyper-converged models becoming much more broadly viable (VSAN and ScaleIO started around that same time).
- There yet another implication – in-memory database models eliminate the need for traditional “external storage” architectures entirely. There would clearly be innovations to arrive as “close to CPU” extensions of DRAM/SRAM and clear use cases – and things like SAP HANA were clearly also going to have an impact. But – the flipside, there’s clear benefits of aggregation and pooling of important, but rare system resources – and the only thing more rare than gigabytes is IOps. This conclusion led to “pooled NAND and post-NAND NVM things will be important”.
Sidebar: It’s notable that the whole “composable” and “disaggregated” system architectures (also often called “Rack Scale Architectures”) with very, very low latency interconnects (directionally, photonics) are a reflection of this observation (that pooling and dynamic distribution of resources at the physical level is coming back into vogue.
In 2011 to tackle the “Pooled NAND” idea, we started an organic R&D project code-named “Project Thunder” to explore the space. “Project Thunder” was a rack-mounted, PCIe-host-connected thing that internally had a pile (at the time) PCIe NAND cards (think of FusionIO like things). Here’s a YouTube video I did at that time with my good friend Wade O’Harrow in 2012. He and I are idiots in the video (what else is new), and were hamming it up – but fast-forward to 14:45 to take a look at Project Thunder
Via Project Thunder, we learnt a lot fast – here were questions we were grappling with:
- It was hard :-) Project Thunder was never HA – and making it HA would be tough.
- IB would simply not cut it.
- The software layer was really, really different than the storage stacks we knew well.
- The density we could achieve when we used industry-standard supply chain elements was very, very limited. Don’t get me wrong, it could smoke a “standard rack mount server” (watch Chad’s World episode 12 to understand our state of the art at the time) in terms of being able to pack in PCIe based NAND card, and full length, packed ones, with tweaked airflow…. But there was a gap.
Sidebar: Interestingly – that learning was an important moment for me – it made me realize that software and hardware are both areas of continued innovation. Yes, you can iterate much faster in code than silicon – but this is why (at least for me) I’ve come to the conclusion that thinking innovation is all about software is as delusional as thinking the reverse is true and all innovation is in hardware.
DSSD is as “hardware defined” as it gets :-)
Read this post where I put that observation into mathematical terms. I’ve always found there’s no easier way to make everyone understand an idea than make it about math (tee hee). My musings on the f(x) = SIN(X) and f(x) = COS(X) nature of software/hardware innovation cycles.
Hint – there are a lot (!) of tidbits on EMC’s long term strategic level thinking in Virtual Geek, you just need to - Follow. Really. Closely.
As the R&D phase on Project Thunder started to wrap up and we started to realize that this was an area where we should “place some side bets”. We looked and found lots and lots of very, very flaky startups claiming advanced dense NAND, and NVMe work. Due diligence would smoke out their flakiness – and most of them are now (only a couple years later) completely vaporized. But… In the search, we found a gem. It was a team in the Bay Area led by the one and only Andy Bechtolshei, but more importantly a great engineering team on many fronts – DSSD. Mental note to self – when hiring individuals or doing acquisitions, the raw talent is almost everything.
DSSD was working on the same core idea as Project Thunder, but much more radical, and tackling all the “what if we did modify the low level hardware” questions.
The DSSD team had already answered the questions we were struggling with – here are the same questions we were stuck on, and the DSSD team approach:
- It was hard :-) Project Thunder was never HA – and making it HA would be tough. ANSWER = design the HA up front, and keep it 100% out of the IO path. Invent new designs of NAND controllers, a crazy new “cubic RAID” model for the super-dense NAND configs.
- IB would simply not cut it. ANSWER = Correct, IB isn’t going to cut it – so do something new. Work with Intel and others to drive NVMe over PCIe (practical and works).
- The software layer was really, really different than the storage stacks we knew well. ANSWER = don’t try to design a software stack for traditional storage persistence (RAID or object mirroring) and data services – this isn’t an array, it’s a new thing, design the software stack in a different way – out of the IO path in entirety. While perhaps data services like replicas/dedupe/compression (which are data-path in nature) may be possible in the future, you can see the effects (both as strengths and weaknesses) in DSSD at GA – focus of the team was on extreme performance, extreme density, and targeting new use cases.
- The density we could achieve when we used industry-standard supply chain elements was very, very limited. Don’t get me wrong, it could smoke a “standard rack mount server” (watch Chad’s World episode 12 to understand our state of the art at the time) in terms of being able to pack in PCIe based NAND card, and full length, packed ones, with tweaked airflow…. But there was a gap. ANSWER = use commodity hardware were it works, but don’t be fundamentalist about it – invent hardware as needed. Ditch SSDs and PCIe NAND cards for Dual Ported ultra dense NAND Flash Modules.
…And so, in late 2012, we made an early investment into a very, very promising team, way, way before the idea was practical.
The answers they came up with to these interesting questions have given birth to a fascinating system level architecture:
Literally – the control plane, the persistence, and the IO interfaces are compartmentalized in a very interesting way, for a “storage thing”. They look a lot more like a “network thing”, where the network is a massive PCIe Fabric. Let’s look at the parts in more detail:
It starts with the Flash Modules (FM).
These are NOT “SSDs”. They are the densest, most powerful NAND packaging in the world right now. Out of the gate, they have 2 and 4TB densities (and all the FMs in a DSSD appliance need to be the same type) with 8TB and 16TB right around the corner.
For perspective, they DON’T use (yet) the 3D V-NAND that we use in the VMAX All Flash (which enable 3.8TB SSD form factors). If that’s the case (and it is) - how are the Flash Modules so dense if they don’t use the most dense NAND? Answer: 512 NAND dies tightly packaged (huge parallelism – for comparison, most SSDs only use 8-64 NAND chips). That parallelism (along with the need to dual port for redundancy) has the FM’s each having dual ported PCIe gen3 4-lane interfaces. Also, that much NAND draws a whackload of power that needs to be provided (it’s far more than most server PCIe slot provide) and to sink the heat generated (the entire FM is a heatsink).
This all makes the a single FM super fast (millions of IOps), and when you put 36 of them together in parallel, you get obscenely fast….
…. But it also means that the route to 16TB is relatively simple and clear (use denser NAND) – even to 32TB and more. At 16TB Flash Modules, we’re north of half a petabyte in 5U. At 32TB Flash Modules – it’s a petabyte in 5U. Wow.
BTW – what’s also important to understand is that the system architecture (the flash modules, but also the also has contemplated NGNVM (Next-generation Non-Volatile Memory – Phase Change, X-Point, Carbon Nanotube based memory and more), and yup – we’re on the case.
This is the IO board (and there are two in a system for redundancy) – you can see 48 PCIe gen3 4-lane interfaces on the front – but you can also see the PCIe bridge fabric. Under each of the heatsinks, you have a PCIe bridge that makes for a non-blocking fabric – every port can get to every FM. A neat little aside – all the cutouts in the board? They are for cooling/airflow purposes. There’s a ton of dense horsepower in the DSSD Appliance, and that requires power, and heat dissipation. The DSSD appliance when you add it all up can draw up to 2000W (even the power supplies and fans are fine-tuned). 2KW in 5U sounds like a lot, until you realize that it can replace many, many rack-units worth of equipment – and actually save a truckload of energy.
This is a DSSD Controller (and there are two for redundancy). It’s interesting – while the focus is on hardware, the DSSD team developed a TON of software – which runs on the controllers. Here are just some examples:
- DSSD does all the FTL (Flash Translation Layer) in DRAM, versus what is usually done, which is caching along a subset of the FTL. This makes things faster.
- DSSD does something called “Cubic RAID” – which is a fancy way of saying that there are multiple dimensions of managing for NVM fault, within NAND chips, Flash Modules, and multiple modules. When you have this amount of NAND (and while this is orders of magnitude more dense than other NAND-based systems it’s only going to go up) – redundancy, resilience is really, really important.
- DSSD has a very cool internal object storage model – where everything (even the logical container that is used to group all the objects) is an object. Some of the team on DSSD were the folks who worked on ZFS originally at Sun – smart folks.
When you put all these parts together – it creates the world’s most dense PCIe Fabric – with Flash Modules connected to the control modules, and then meshed completely into the IO ports. Add it up – and you have a performance monster.
Ok – now we’re going to take a quick “right hand turn”. Question: How is this related to convergence and hyper-converged infrastructure stacks?
All things are connected. The topic of “next generation persistence and it’s impact on system-level design) came up in parallel in 2013 as we were thinking about the converged infrastructure models of tomorrow (another strategic project in 2013 was the first phase of “where to take VCE next” – and led to EMC doubling down and bringing VCE into EMC fully).
The Vblock was already doing OK, and we were looking that the architectures of hyper-converged appliance models of Nutanix and Simplivity who were starting to appear materially on the scene. No joke – this is a slide from the way-back machine at that time:
Blue = compute, green = network, red/orange = storage/persistence.
Obviously since then, VCE moved past just “Block” converged models, and the picture on the right would now be inclusive of VxRail.
Sidebar: If I were to update that slide (this was back from 2013, about 3 years ago, I’m smarter now J ), the only think I would change is that the “hypervisor” word I would remove. It’s really about the “logical infrastructure abstraction and pooling layer” (could be any abstraction)
Our thesis was that in the core datacenter, disaggregated, composable architectures would be important over time – and this idea of “ultra low latency pooled persistence” would be important. Here’s the next slide, which has a take on “rack scale infrastructures”. Yes, one defining element is that you can see that the network is a central concept, but also you can see a lot of hardware disaggregation and pooling (aka “composable” infrastructure).
And a little more detail:
Now – this is a logical diagram of where I suspect VxRack will go in the coming years. On the far right you have a “lambda” based model (only very hot, and very cold persistence). Very quickly DSSD will be be pressed into the option for the red on the left (“near to the blue compute”)
The first versions of DSSD in a VxRack look like this example on the left. Incredibly dense, and designed to be deployed, pooled and consumed at Rack-Scale (hence “Rack-Scale Flash”)
For the people following closely yes, those are the same nodes that you find in VxRack and VxRail. I suspect that over time, you will see more pooled and composable hardware designs in our Rack-Scale offer. You can imagine the power of upcoming moves.
Hint, hint :-)
I hope you find this all interesting – not only the “what is the news today with DSSD going GA”, but a little “behind the scenes”. It highlights why customers partner with people over the long term. Note that the news of today started almost 5 years ago. It takes a while to build something cool and awesome.
It kind of makes you wonder… What are the cool things we’re starting to work on now, that will show up in 2021? Stay tuned :-)
Wondering your thoughts – please comment, and share!
Awesome! The thought of DSSD integrated with a VxRack ScaleIO Flex is just unreal. The possibilities...
Posted by: AF | February 29, 2016 at 09:14 AM
I can't *wait* till the Flood API gains some external viability - some really cool things I want to do with that.
Posted by: Mcowger | February 29, 2016 at 08:17 PM
Wow !! VxRack with DSSD smell Exadata Killer.
Posted by: Fthib | March 02, 2016 at 09:26 AM
Great stuff as always! Hopefully we're placing bets with NAND alternatives too :)
http://www.adestotech.com/about-us/technologyip/
Posted by: vArch | March 02, 2016 at 06:55 PM