HDFS is becoming the “underlying strata” in the data fabric for many new applications. We’re all over it here at EMC, but it means we have 3 current answers when a customer says “I want infrastructure to support HDFS” (and remember, HDFS supports more than just Hadoop and MapReduce these days…)
Now, the battlegrounds over the last few years (block/NAS, mid-range/enterprise, AFA/Hybrid, hyper-converged/standalone) have a new variation… HDFS implementations.
Read on for my (personal) HDFS solution decoder…...
At EMC, when a customer says “I need storage for HDFS” we at EMC have 3 immediate platforms, with a 4th (DSSD) that is relevant to include in the discussion but as a “strategic future” (as we’ve indicated it will be interesting to some in 2014, and more broadly in 2015).
Those 4 HDFS stacks are: 1) Isilon, 2) ViPR HDFS; 3) VNX as high bandwidth block storage; 4) DSSD.
To understand how these bolt into Hadoop – please check out this post.
A quick thing to understand is that all HDFS implementations must be “HDFS 2.0/2.2 compliant” and WORK with existing distributions (Cloudera = #1, Hortonworks = #2, PivotalHD = #3, others are below those marketshare stats).
HDFS 2.2 is “latest” Apache implementation, we support 2.0 in Isilon/ViPR for now, with 2.2 coming soon.
All of our solutions support HDFS in one of 3 ways:
- They use the existing HDFS client library (Isilon does it this way) in the distribution.
- They sit UNDERNEATH existing HDFS implementation of the distribution as “better than DAS” (VNX)
- They require that customer to use their particular HDFS client library which is API compliant (ViPR HDFS and DSSD).
The “dominant” party in these HDFS discussions is the distribution (Cloudera/Hortonworks/PivotalHD) the customer has signed up for – so it’s important we as field architects, and HDFS platform vendors align and partner with the major distributions.
Each distribution have their own native HDFS implementation, and some of their distribution-specific upper-level analytics tools, and management tools ONLY WORK with their particular HDFS implementation (though they shouldn’t!) – so this is tricky.
Isilon’s HDFS “super-powers” are:
- NAS import/export with HDFS access to the same data without moving it in or out. This can be huge.
- Performance *WHEN* the customer understands “performance” as “total job time” (including import/export). individual query performance is generally not as good as other approaches, but often total job performance is AWESOME.
- Better resilience than even the Namenode improvements in HDFS 2.2
- Isilon has data services like SyncIQ and SnapshotIQ which apply to HDFS when it’s on Isilon, and enterprise customers dig.
Net: IMO, Isilon's primary use case (not only) is when a customer has a need for Isilon first (I.e. “more of their enterprise data is in other forms like NAS”), and a side-helping of HDFS for large dataset, non-realtime HDFS applications - which is most enterprises. This is why the Isilon HDFS solution is so strong in enterprises – which have more NAS than they do HDFS, at least right now). Without the Isilon workflows (if a customer only had HDFS – which NEVER happens in enterprises), IMO (and this is my opinion), there are better ways.
ViPR Object/HDFS stack’s HDFS “super-powers” are:
- Object import/export and edit with HDFS access to the same data without moving it (same as Isilon, but for object, not NAS). This can be huge.
- Will get even more confusing when ViPR gets NAS on Object (will come later this year), and ViPR will be able to do NAS/Object/HDFS (but will be a very basic, and non-transactional NAS on top of the ViPR storage engine).
- Very good data protection (geo erasure encoding + data center availability) and geo availability of data
Scales to VERY large scale (PB/EB) very well. - Better resilience than even the Namenode improvements in HDFS 2.2
- ViPR can be offered TODAY in “Software only, bring your own Commodity off the Shelf (COTS) hardware” (ViPR) or in “Software, EMC provides COTS hardware in an integrated appliance” (EMC Elastic Cloud Storage) – good fit for the “data tub” use cases.
IMO, ViPR's primary use case (not only) is when a customer has a need for object and HDFS - which is most SPs or Web 2.0 customers (which is why the ViPR play is so strong in SPs, Web 2.0 customers, and customers really focusing on Object/HDFS FIRST).
VNX's HDFS “super-powers” are:
- as a basic block stack, VNX plugs right under ANY HDFS implementation – when a customer is locked into (or convinced of) a particular distribution HDFS implementation.
- VNX has a low cost $/GB, and good IOps and great bandwidth relative to DAS. If you look at VNXe, it also beats DAS in density.
- VNX Manageability (Unisphere Central, ViPR Controller automation,etc) and availability (you aren’t constantly recovering failed data nodes) if much better relative to DAS, which is important as customers scale up big (but are locked into a distributions’ native HDFS implementation).
IMO, VNX primary use case is for when a customer is fully committed to the native HDFS implementation of the distribution. In these cases VNX is our best solution. For a customer – the value is: “imagine the cost point close to that of DAS, but with better performance, manageability, density and HA. No joke – as these HDFS clusters get really big, data node failure at scale is bad (customer tell me).
In these cases where the customer is committed to the native HDFS implementation in the distribution, changing the HDFS implementation with Isilon or ViPR is one the customer needs to consider carefully. BTW – in these cases, the competition is often DDN or NetApp E-series (which like VNX are fast, high bandwidth block stacks)
DSSD’s HDFS “super-powers” are:
- DSSD is hyper-low latency (think 10x lower than AFAs)
- DSSD is hyper IOps and capacity density (think 10x denser than AFAs).
- DSSD is a native HDFS/Object implementation
IMO, clearly DSSD is not a material product for this dialog for the bulk of this year. It also may sound like these are “esoteric” use cases – they are NOT. They are very important to some customers, and I would wager they will move from “rare” to “common” over the coming years.
For DSSD, the best fit is adjacent to the other 3 HDFS architectures and focusing on in-memory, or HDFS layered transactional realtime use cases (and NoSQL/SQL DB like HAWQ, HBASE, MongoDB, etc).
Hi Chad, is there any reason why you do not see ScaleIO as a suitable Storage for HDFS?
ScaleIO's HDFS “super-power” would be that ScaleIO does not add any external (HW) component to the original HDFS architecture as a grid-of-nodes, yet bringing in higher Data Availability.
Posted by: Iciliop | May 15, 2014 at 03:45 AM