Since launching Bacon Wrapped Data two months ago, I’ve written numerous posts on Big Data but have yet to say much about the world outside of massively parallel RDBMSs. In recent weeks, Hadoop has roared back into the headlines on a number of fronts, so let’s take a look at the current state of the Hadoop movement and, more broadly, the overall state of MapReduce.
It’s now midway through 2011. Hadoop is closing in on four years of public releases, leading commercial vendor Cloudera has recently released CDH3 to much fanfare, and both IBM and EMC now have their own distributions of Hadoop. On the relational side, Aster Data’s industry-leading SQL/MR has been available for almost three years, and most other Big Data RDBMS vendors have since jumped on the MapReduce bandwagon (either with their own versions of MapReduce, or with some form of Hadoop integration). Yet despite its increasing momentum, growing media hype, and all-around potential, MapReduce is still very much a niche technology, used by specialized teams of programmers in a small (albeit growing) number of companies. Why?
Last week, Forrester’s James Kobielus posted his thoughts on the current state of Hadoop and what it will take for the open source platform to gain broader adoption in the enterprise. He pointed to three factors he believes necessary for Hadoop to be considered “mature”:
- An increase in adoption by EDW vendors,
- Convergence on a core Hadoop “stack”, and
- Formal standardization of that stack
I disagree with Mr. Kobielus on all of these points. In fact, I believe that there’s a single underlying reason why Hadoop and, more generally, MapReduce, has failed to gain widespread adoption. To hijack an already-overused catchphrase:
It’s the API, stupid!
Another Brief History of MapReduce
Let’s start with a quick history of MapReduce, which I think is important because so many people misunderstand what Hadoop and MapReduce really are. MapReduce, as it’s known today, was first described in an academic paper published by Google in 2004. At a high level, it’s a programming model intended to facilitate large-scale parallel processing on top of a distributed data store.
Key Point #1: MapReduce is a general method for designing programs to do parallel processing of data, not a specification. (It’s a paradigm, not an API.)
Key Point #2: An implementation of MapReduce operates on a distributed data store of some kind.
Given these two points, a general parallel processing architecture based on MapReduce looks like this:
In Google’s case, that underlying data store is a proprietary distributed file system called the Google File System (GFS). Here’s what Google’s parallel processing architecture looks like:
Hadoop* is an open source implementation of these two academic papers.
- The Hadoop Distributed File System (HDFS) is an interpretation of the Google File System paper
- Hadoop’s MapReduce Engine, which comprises a job tracker, task tracker, and related APIs, is an interpretation of Google’s MapReduce paper
Key Point #3: Hadoop is not MapReduce. Hadoop is the combination of one implementation of the MapReduce paradigm (which is different from Google’s implementation) and the HDFS data store.
Here’s what Hadoop’s parallel processing architecture looks like:
*For the purposes of the above discussion, the term “Hadoop” refers to Apache Hadoop. There are now a number of flavors of Hadoop, which share varying amounts of code (in other words, you could split the above “Hadoop” architecture into Apache Hadoop, Cloudera Hadoop, IBM Hadoop, etc.).
Hadoop was originally the creation of Yahoo!, but they were far from the only ones taking note of what was coming out of Google. The GFS and MapReduce papers quickly became the focus of much of academia – especially at Stanford, where Google was formed. Stanford, of course, was also the birthplace of Aster Data (11 of the first 12 employees at Aster Data were Stanford grads – we had one Michigan alum for diversity’s sake ).
Our foremost goal while designing nCluster was to enable users to gain the greatest possible insights into their data, and we all agreed very early on that SQL was not sufficient for this. While exploring different ways to expand beyond SQL, we quickly converged on the major revelation captured in Key Point #2 above: the MapReduce paradigm could work on top of any distributed data store, it didn’t have to be a distributed file system like GFS.
So, why not put an implementation of MapReduce on top of a distributed relational data store…?
Key Point #4: Aster Data’s SQL/MR is the combination of yet another implementation of the MapReduce paradigm (which is different than both Google’s and Hadoop’s implementations) and Aster Data’s distributed relational data store.
Here’s what Aster Data’s parallel processing architecture looks like:
Now, let’s compare all of these architectures and see what they have in common:
All of them are clearly interpretations of the “reference” architecture, but other than that…not much! And therein lies the problem.
Back To The Task At Hand
So let’s return to Mr. Kobielus’ post, but first, let’s broaden the discussion. Instead of asking what it will take to increase adoption of Hadoop, let’s ask what it will take to increase adoption of MapReduce, in general. Why? Because that’s what the real market for advanced analytics is today. When customers are looking for analytical platforms capable of going beyond SQL, they’re not only looking at Hadoop, but also Big Data RDBMs which have MapReduce support, as well as emerging platforms that offer other alternatives (such as the recently announced Hadapt platform, which looks to provide a MapReduce interface atop of two distributed data stores).
Let’s start by addressing Mr. Kobielus’ first point: that Hadoop (MapReduce) success requires an increase in support by EDW vendors. This is simply not true. There’s nothing special about the current set of EDW vendors (either in their own minds or those of their customers) which anoints them guardians of all things analytical. In this respect, I agree with Curt Monash’s response to Mr. Kobielus’ post: Hadoop and MapReduce do not require EDW support in order to gain legitimacy in the enterprise. (If anything, those vendors who fail to jump on the bandwagon will be left behind as support for MapReduce becomes an expected feature.)
Mr. Kobielus’ second and third points are related: he claims that in order for Hadoop to increase in adoption, it’s necessary for vendors to converge upon a stack and submit that stack for standardization. This couldn’t be further from the truth! The true power of MapReduce comes from the fact that it can operate on any distributed data store (Key Point #2). There’s absolutely no reason why we should want to limit ourselves to a single underlying data store. Doing so would stifle competitive innovation while doing absolutely nothing to increase adoption. Would someone considering IBM’s Hadoop distribution really reject it because the underlying data store isn’t compatible with EMC’s? Or for that matter, Aster Data’s?
Of course not. The problem isn’t that the stacks aren’t compatible, it’s that the APIs aren’t. Think about it:
- Programmers and data scientists using a given MapReduce implementation can’t easily transfer their analytics between platforms
- Ecosystem vendors (BI vendors, ETL vendors, etc.) can’t easily optimize their products across all MapReduce platforms
- Library vendors can’t easily create sets of functions that support all MapReduce platforms
Not to mention the fact that, with the exception of Aster Data’s SQL/MR, the only people who can actually use a MapReduce platform today are the subset of programmers who understand the paradigm!
In part two of this post, I’ll expand upon my claim that the lack of a standardized MapReduce API is the real limiting factor in adoption of MapReduce platforms, both for users and ecosystem partners, and discuss ways in which the broader Big Data industry can spur adoption.