« EMC World 2016: VxRack System with DSSD coming to melt your face soon :-) | Main | Something UniK, and really cool! »

May 04, 2016


Feed You can follow this conversation by subscribing to the comment feed for this post.

yaron haviv


please help me out w the math, if 400 nodes produce 1.5 TB/s it means each contributes only 3.75 GB/s, how does it map to the 15 GB/s number ?

each 40Gb/s link is 5 GB/s, so assuming 15GB/s per box it means the 8/16 wires are significantly over-subscribed ? and if its 3.75 GB/s that is even more over-subscribed. BTW for a box that would GA in 2017 i suggest looking at 50/100GbE as an option

re the IOPs, the S210 model (out in 2014) does ~100K IOPs, so 250K IOPs is sort of aligned with Moore law, i believe new All-Flash solutions should aim at >1M IOPs, and I know its possible even for File & Object (if you have the right architecture)


David Holmes

As I walked into the keynote yesterday, I said to a colleague "you know, the only primary storage product that isn't all-flash is Isilon, I wonder when that will happen?" Then 14 minutes later it did...


After reading the complete BS of a premise of this post - Chad Sakac being unaware of a "new bladed Isilon architecture" and a "complete OneFS re-architecture for Flash", I couldn't read on and take the rest of this post seriously.

Things sure have gone down the toilet at EMC if the President of the technical Presales organisation and the President of VCE is unaware of this "facemelting" new project in the works. And if you were, then why say you weren't??


Chad Sakac

@yaron - thanks for the question! The final specifications won't be landed until we GA. From my understanding, some of the limits are indeed associated with the switching fabric. And yes, you can count on the fact that we will be evaluating the state of the state with 100GbE at that time. Also, I would consider the node performance specs also to be likely to go up. The main point here isn't node IOps (if all you need is IOps, you likely would use a block target)... the main point is the very low latency (and the high file ops/s).

@David - glad you dug it!

@TheDude - ah, internet trolls :-) Hiding in anonymity and casting aspersions. The fact that I didn't know about a secret project inside the company? That's not BS - it's a fact. Personally, if you were in my shoes, you'd know that it sounds MORE crazy to claim that I track all the stuff that happens inside EMC, VMware, and Pivotal - there's a LOT. You know what I think is sad? Your Debbie-downer attitude. Have a GREAT day!


"...instead of discovering something that is half-baked, I discover something that is nearly done."
"(including emerging players efforts that are still NOT on the market)."
"Now – neither of these two are generally available yet...We’re aiming for... 2017."

Someone proof read this, right?
Should be writing for comedy or a Presidential candidate.
Discs aren't the only things still spinning at EMC.

yaron haviv


thanks for the clarification, as a Hardware guy you would know producing 12GB/s on a distributed storage node requires at least 40 PCIe lanes, sort of max out Intel dual socket capabilities.

Today with winds shifting to distributed cloud-native architectures, BigData, IoT, .. block is somewhat becoming irrelevant, and App developers need Files, Objects, and NoSQL. so we have to provide both faster & higher level storage, i can see a bunch of IoT scenarios that easily drive Millions of IOPs on small files/objects. so we cant relay on Block for the performance part. Same for Latency.

it is possible to drive Millions of IOPs and bare-metal latency for the upper layer abstractions, you do some of it in DSSD, but that is not something you can do with a 10 year old stack, require a complete redesign.

can read on the fundamental sw/hw principles to get there in:


Chad Sakac

Yaron - I read your blog, thanks for adding to the dialog. I've also updated the post (intentionally, we're keeping some of the node/blade relationship and blade details to ourselves).

Personally, I think you might (?) underestimate the difficulty in the upper levels of the stack, and your observation in your post that no one seems to have cracked it all, might be rooted in "it's harder than you think". For example, a team of smart folks have been cranking on it in DSSD (which is after all a complete redesign, and a native Object storage model) - but note that it doesn't have the data services, or scale-out models that many of these HPC/EDA use cases want.

Suggestion - you have a passion for this. Consider giving it a shot! If you do succeed to do this in a software stack, you'll make a fortune and remake the this part of the world. If you gather together an engineering team, build a prototype, I'm happy to help connect you with the VC community (at least those I know).

yaron haviv

Thanks for the suggestion, keep an eye on what iguaz.io will announce in the not too distinct future, you will be amazed :)

indeed this is far from trivial and require an exceptional team ...


Dave Graham


hey Chad, long time, no talk! (it's Dave Graham from back in the Atmos days...lol). Nice article.


so, as a hardware guy, I think you're missing a few key pieces.
a.) PLX PCIe switches. 40 Lanes of PCIe Gen3 from Intel Xeon E5s can easily be split into n-number of locally switched or orthogonal lanes within the complex of a 4RU box. As someone who was privvy to some internal EMC architectures that are more recently come to market, this makes complete and logical sense. At some point, however, yes, you're contextually over-subscribed on your internal/external links for bandwidth.
b.) One thing that's always been curious to me (and is really a relic of the days when SM and Mellanox ruled the inner fabric bus of Isilon) is the relative pervasive use of IB as a transport bus given the older 8/10b encoding schema. moving to EDR/FDR obviously changes the encoding overhead (iirc 64/66b) and allows for better utilization of bandwidth but with no specific tunneling offload present in Xeons...oy. this again points to an ASIC-based approach to handle front end connectivity (more than likely Mellanox though I know EMC has been rather loathe to use their technology in a directly correlated box).
c.) Intel Omniswitch/Omnifabric. This makes the most sense as an INNER node fabric technology. EMC has significant vested interest in Intel (and vice-versa) and their technology curves and with something that keeps a proprietary edge (though can be commodity-built as evidenced by SM building Omnifabric switches) while looking like 100Gbps EDR Infiniband but combining the simplicity of a switched transport layer like Ethernet...well, you get where I'm going. Omnifabric has the capability of offload (e.g. through using a many-core solution like Knights Landing) or through DMA-style access to QPI/ring-bus technologies.

I like the sound of Nitro and it'd be an interesting application of some "tools" that are lying around in the hardware/software space. Heck, if OneFS got more GPFS-like, that wouldn't be bad either...rumour has it, Lustre is getting there. :P



yaron haviv


i consider myself a software guy who knows a lot about HW :)

for the 12GB math you need 40 Lanes w/o over subscription just for the networking side 16x for front and 24x for back assuming they use erasure coding (not that i think its realistic to expect 12GB/s Erasure w/o HW offload, nor do i think they know how to process TCP/IP at those rates) and 32x for 2 replica, if you have SATA/NVMe that will be more lanes and you need at least 16x of those which may be split using PLX.

Knowing Intel Omnipath schedule and EMC test cycles (and OneFS clustering stack), its not practical to expect EMC GA in 2017 for an Omnipath based product.

so my guess is Nitro has 2-4 Blades per node, which help sort-out the math

BTW you can use Mellanox 50/100GbE NICs would make a better choice than IB IMO, and the new model has a line rate Erasure code engine i helped to design :)


The comments to this entry are closed.

  • BlogWithIntegrity.com


  • The opinions expressed here are my personal opinions. Content published here is not read or approved in advance by Dell Technologies and does not necessarily reflect the views and opinions of Dell Technologies or any part of Dell Technologies. This is my blog, it is not an Dell Technologies blog.