This week has seen big moves, big changes in the big data landscape – and that’s big :-)
First, the industry took a dramatic turn – led by HortonWorks, Splunk, Pivotal and others with the creation of the Open Data Platform (ODP) intiative.
This isn’t a naming or marketing exercise. It’s a fundamental shift for HortonWorks (see what they say here) and Pivotal – and is joined by many big industry players like IBM, SAS, Splunk and others.
Most importantly, it’s a shift where Pivotal is contributing more code materially to open-source. Specifically 3 big pieces of intellectual property to this initiative by open sourcing:
- SQL on Hadoop engine HAWQ (this is a complete SQL transactional DB layer and a higher performance and more complete alternative to others choices like Hive)
- GemFire (an ultra high transactional in-memory DB)
- GreenPlum (a Massive Parallel Processing analytical DB)
HortonWorks and Pivotal started to see this need for a “hardened enterprise open core” as they collaborated on Ambari as an industry-standard open way of managing Hadoop clusters (and is one of the Hadoop-related Apache projects).
Today, BTW – along with analytics elements on top of the data engines, the topic of “how do you manage Hadoop clusters” is where there is a lot of forking (and proprietary value) between the various distributions.
Together with the other open-source elements in Apache® Hadoop® (follow the link, and you’ll see why this ODP initiative is important – you have some tightly coupled parts, and some more loose projects) – these are packaged up together into the Pivotal HD3 distribution (as well as other distributions like HortonWorks)
What’s this all about? Read on!
Hadoop has been a very successful open source project, and is governed via the Apache® Hadoop® project. That said – unlike other big open-source projects, there hasn’t been a strong governance model as it moved into a phase with many large commercial distributions that all started running in different direction.
This isn’t just something I am thinking, but customers themselves have started to observe to me that various distributions are starting to create feelings of “lock in” further above the stack – where value-added components (whether it’s specific analytics tools, or management tools) “only work with _____ distribution”. BTW – the same can be said about the various Linux distributions.
This “forking” is mostly true not in the “trunk” code (Hadoop Common, HDFS, MapReduce, Yarn), gets tricker in “newer ground” of SQL data engines (Hive, HAWQ, etc), NoSQL (HBase, Cassandra) and in-memory data engine (Spark, Shark, Gemfire) domain. It’s notable that Pivotal’s effort and contribution has been in projects outside the “trunk”, but on projects like Tachyon (associated with in-memory). This “forking” and incompatibility is trickiest of all in the domain of management and in the analytics tools on top of the data engines themselves.
There’s clearly a need for commercial entities to differentiate and provide value on top of open source – but there’s a balance, and that balance is what the “governance” efforts are all about.
Are governance models necessary in open source? There’s real and rational debate on this topic. In my humble opinion, when they start – NO. As they grow, become commercial forces, YEAH. Now there are different types of governance models – so there’s no one formula that works. What is always necessary is the governance body to really be a steward – not of any one commercial entity, but of the project itself.
I’ll give you an example that makes this clear for me.
I don’t think Linux would be where it is without governance. But governance comes in many forms. In the early days, that Linux “governance” seems like it was Linus Torvalds himself, which is absolutely amazing to me – because at least from where I sit, to govern by sheer force of moral authority and individual code contribution on a massive project says something profound about that dude.
Later on (including now), the Linux Foundation played a critical role – because Linux has huge commercial forces at play, and there is a need for stewardship above and beyond what any one superman can do.
Another example that is an interesting one is the Openstack Foundation. Similarly, in the earliest days of Openstack, there was no big need for a foundation, no need for a tech committee, and no need for a Board of Directors. There was just “code contribution” as the only guide.
In later days (again, like right now) for Openstack “code contribution” remains the single most important measure of moral authority on any of the sub-projects. BUT now with huge commercial forces at work (and dramatic forking of those commercial offers) the governance of the Tech Committee and Board of Directors forms a critical “stewardship” role.
Yet another example is how Pivotal and others like IBM created the Cloud Foundry Foundation making Cloud Foundry itself a project managed as Linux Foundation Collaborative project.
Why the heck would Pivotal do that – doesn’t that “cede control”? YES. So why do it? Because Cloud Foundry was becoming so successful as the open PaaS standard – that no open governance model would start to retard the use – because any single vendor, regardless of noble intent, will pervert the project.
When an open-source project becomes succesful and widely deployed, you start to have huge commercial forces (most notably Pivotal Cloud Foundry and IBM BlueMix) at play. There is no conflict here – passionate IBM developers work hand in hand with Pivotal folks (and all sorts of others, including EMC) on Cloud Foundry – but as these move mainstream and commercial offers start to diverge/fork with proprietary closed work on top of the open foundation, just like Linux, just like Openstack – Cloud Foundry needed an open governance model.
It’s pretty clear that the core governance provided through the Apache® Hadoop® software project is working, and working well for stuff that is at the fundamental trunk and APIs at the core: HDFS, Yarn, MapReduce. It’s also working for lots of other Apache projects Yarn, Pig, Hive, etc that are part of the Apache Hadoop Framework.
Hortonworks, Cloudera, MapR, Intel, Pivotal all are contributing – HortonWorks very notably. But – you are seeing increasing material divergence in the commerical offerings, and as I noted – I’m hearing customer frustrated with higher level functions and elements that are “locked in” or proprietary.
That’s reason #1 for ODP.
I think that there’s another reason. I think that Pivotal realized they should have been contributing more from the earliest days, and the fact that they are open-sourcing products (HAWQ, GemFire, GreenPlum) that generate huge revenues (implying customer value) shows they are serious about code contribution. I think the experience with Cloud Foundry and how important openness and governance was for an open-source project as huge commercial entities started contributing and playing taught important lessons.
That’s reason #2 for ODP.
Now, where’s Cloudera in this (as one of the larger contributors to Apache Hadoop and commercial Hadoop distributions)?
Frankly, Cloudera (and MapR) are notably absent – and by that I really don’t mean to imply good or bad, just… absent.
There’s a very interesting post here that outlines perhaps some of their rationale, and in the spirit and openness, I think it’s worth visiting and understanding their
I think if read that blog post, and then you look at the things I noted above, some of the argument made in the Cloudera post don’t make as much sense (at least for me). Pivotal IS contributing code. This effort isn’t analagous to OCF, but more to the Linux Foundation and Openstack Foundation.
While there is no doubt that Cloudera is one of the largest single players in the Hadoop ecosystem (along with Hortonworks, MapR, Pivotal and others) – I don’t think some of the arguments hold up (at least for me).
You can answer for yourself why Cloudera might not participate (at least for now). Perhaps they like the idea of a more closed ecosystem on top of the Apache Hadoop open source base.
Now, I want to be clear here – Pivotal and EMC are separate entities. This is one of the core structural elements of the Federation model that people don’t always grasp. OPEN = each element of the federation partners openly in all directions. So, it’s not a surprise that EMC partners with Cloudera like crazy, and this doesn’t change that dynamic one bit (even as we participate in ODP ourselves).
In fact, the very same week (yesterday in fact) at Strata + Hadoop World in San Jose we launched a huge update to Isilon – with HD400 (huge – up to 50PB clusters!) nodes to the existing mix of X, S, and NL nodes, full support for HDFS 2.3 and 2.4 and native integration with Apache Ambari.
Isilon is VERY commonly deployed in support of Cloudera and Splunk (there are more deployments supporting those than there are for Pivotal HD) – and just follow what you find if you google “Isilon Cloudera” and “Isilon Splunk”
Why do people deploy things like Cloudera and Splunk on Isilon? Simple. Relative to native HDFS implementations on DAS, at scale people find it easier to manage, more feature rich, a lot more dense, a better TCO, and in many cases shortened total query time and simplified workflows. It’s a little counter-intutitive, but customers in droves are starting to realize this (it’s a huge factor behind Isilon’s massive growth rate) – breaking it down:
- Easier to manage: simple at any scale – whether you are starting with 3 nodes, or have scaled up 50PB, a simple way to see, manage, add capacity
- More feature rich: rich capabilities for snapshots, for remote data protection. “Backup” doesn’t make sense for data lakes via traditional backup – but the concepts of data protection still are needed. BTW
- More dense: the SMALLEST Isilon node (S-node) is up to 28.8TB per 2U. The most dense (HD-node) is up to 354TB per 4U. That’s “Hulk” level density, and crush the density you can achieve just using a generic server.
- Better TCO: on MANY levels and dimensions, but the most obvious one is erasure-coding protection (1.5x-ish capacity) vs. 3 copy (3x capacity). Another example is that it’s really easy to scale capacity and compute independently – which more often than not is necessary.
- Shortened query time: being able to access the same data (without moving it) via NFS, HDFS, and even Openstack Swift means that for many queries that involve MapReduce for import/export – the total time is dramatically reduced.
- Simplified Workflow: frankly, for many enterprises, Hadoop started as an “island”, and now needs to link with a ton of data that is in other unstructured forms. The fact that Isilon clusters can be “data lake foundation” for all forms of unstructured data in an enterprise can accelerate existing workflows, and enable entirely new ones.
Here’s findings (not marketing) from an actual PoC at Adobe Digital (happened to be with Pivotal HD, but the customer is evaluating the various distributions, and wanted to create a data lake foundation that could work for various distributions). BTW, thanks to the Adobe Digital team for being a great, cool customer – and Paul Joyce, you are an unbelievable EMC SE my brother!
At the Hadoop Summit in June, Adobe Digital will be sharing their learning through the project – check it out here.
And – in 2015, customers should expect EMC to come out swinging with all sorts of cool things including more on Isilon, but also some additional (for now) secret things to support all the Apache Hadoop distributions, and many of the Hadoop-related projects like Spark.
So – what’s the takeway?
- The Open Data Platform initiative is an open effort to try to extend the standardization and interoperability above and beyond the foundational core Apache Hadoop core.
- It is OPEN. It is modeled on other successful governance models for mature, widely deployed enterprise open-source projects that have lots of commercial players all vying for “control”.
- Experience has shown that as open-source project move into mature competing commercial distributions – these governance models (done well) don’t constrain, but guide with the intent of advancing the project as a whole (which can come at the expense of any GIVEN commercial entity)
- Some entities inevitably dig that – others don’t. It’s a little instructive about the business model pursuits and strategic intent.
- The Federation model is open. EMC participates in OCP with Pivotal, but also partners like crazy with some that may not love OCP – a prime example is Isilon and Cloudera.
What do you think? ODP – good thing, or bad thing? Using Isilon with Cloudera/Splunk/Pivotal – how’s it going? Share!
Comments