It’s The API, Stupid! (Part 3)

  • 404 Not Found
  • Not Found

Last week was definitely a busy one in the MapReduce world!  At the annual Hadoop Summit, Yahoo! officially announced the spinoff of HortonWorks (possibly the worst kept secret in the Hadoop community), and Cloudera and MapR both announced new distributions.  With even more fragmentation coming to the Hadoop community, what better time to wrap up this series on the state of MapReduce.

In my previous two posts, I argued that the current state of MapReduce APIs is the fundamental limiting factor in widespread adoption of MapReduced-based technologies, such as Hadoop.  Specifically, I argued that there are two big problems with today’s MapReduce APIs:

  1. The APIs of various MapReduce-based technologies are generally not compatible with one-another
  2. With a few notable exceptions, the APIs of most MapReduce-based technologies are not accessible to the majority of users

In Part 2 of this series, I discussed at length the impact of a lack of compatibility on ecosystem development, and how that holds back enterprise adoption.  In this post, I’ll tackle the second problem with today’s MapReduce APIs: most people can’t use them!

The Impact of API Accessibility

Let’s start with why MapReduce is so appealing.  In contrast to what some SQL-devotees may cling to, the fact of the matter is that SQL has its limitations.  While SQL can be made to be Turing complete, practically speaking, there are many things that are not possible, especially when large amounts of data are involved.  Moreover, oftentimes expressing a complex algorithm in SQL is so cumbersome that it becomes (a) difficult for even the author to understand, and (b) results in an unnecessarily complex execution path (and, therefore, an expression that takes an unnecessarily long time to execute) because of the need to work within the confines of SQL syntax.

The MapReduce paradigm, in contrast, provides an elegant, language-agnostic method of executing arbitrary algorithms against large parallel datasets.  At Aster Data, we frequently demonstrated this with examples showing that a 100-line SQL query could be reduced to fewer than 10 lines of MapReduce code.  Not only did this mean that it was easier to write, but it would also result in a drastically faster execution time.  So if MapReduce is so elegant, why not use it for everything?

Today’s MapReduce APIs Are Complicated

The problem is that, as of today, the primary means of interfacing with MapReduce implementations is through complex programming APIs.  In order for someone to effectively use MapReduce, one must not only be a relatively accomplished programmer, but one must be able to grasp the nuances of parallel programming.  While many programmers undoubtedly find this reasonable, we need to remember that there are many intelligent, insightful people who haven’t the faintest idea what a class-statement is.  The lack of an easy-to-use interface effectively renders MapReduce inaccessible to a large percentage of the workforce.

We programmers sometimes get so caught up in our own world – one where anyone who can’t code must not have much to contribute – that we forget that we make up only a tiny fraction of the workforce.  The next thing you know, people are predicting that Hadoop is going to take over the world and doomsayers are calling for the death of the RDBMS.  Kind of reminds you of the folks who twenty years ago predicted that Windows was dead and Linux would rule the world, doesn’t it?

SQL’s Not Going Anywhere, So Get Over It

Here’s the problem: the people predicting that SQL will go the way of the dinosaur are conveniently forgetting that the vast majority of the workforce doesn’t know how to program, much in the same way that the Linux advocates of years past ignored the fact that most people freak out when they see a command line.  Just as the number of Linux users remains small relative to the total workforce, so too will the number of MapReduce programmers.  In order to appeal to the broader enterprise workforce, MapReduce implementations need an interface that’s accessible to and usable by that workforce.  And that interface is going to be SQL.

Whether some like it or not, SQL is a prevalent, easy-to-learn declarative language that is widely used by non-programmers.  There are literally hundreds of thousands, if not millions, of business analysts, planners, marketers and other valuable employees who can (and do) use SQL on a daily basis, either directly or indirectly (via BI and other ecosystem tools).  For MapReduce platforms to tap into this workforce, a similarly easy-to-use/easy-to-learn interface needs to be developed – and standardized.

Unfortunately, the level of attention paid to usability at last week’s Hadoop Summit is best captured by the following tweet from Cloudera’s CTO and Co-Founder, Amr Awadallah:

Amr Awadallah

awadallah Amr Awadallah
We (@cloudera) just launched SCM Express for free, I think even my mom can quickly install a Hadoop cluster now :) http://goo.gl/cUYfm

 

(It’s a shame his mom won’t have a clue what to do after the installation completes!)

Big Data RDBMS vendors Aster Data and Greenplum were the first to recognize that a SQL-MapReduce bridge would be the natural first step to tapping into the broader enterprise workforce.  The Hadoop community followed with the Hive project. The first effort at bridging SQL and MapReduce came from Hadoop power user Facebook, which presented Facebook Hive (the basis for Apache Hive) to the world at the March 2008 Hadoop Summit.  Later that year, Big Data RDBMS vendors Aster Data and Greenplum announced their MapReduce platforms.   [Corrected based on feedback from Jeff H. below.] These interfaces still need to mature and standardize (as discussed in my previous post) before we see real adoption, but they’re a step in the right direction.  Of course, bridge interfaces only allow SQL analysts to access MapReduce programs written by other programmers.  True acceleration will come when a vendor develops an easy-to-use interface for non-programmers to create their own MapReduce functions…but that’s clearly many years away.  For now, the focus needs to be on providing SQL users with easy access to MapReduce programs written by others.

Wrapping It All Up

In this three-part series, I discussed the current state of MapReduce and argued that the fundamental limiting factor in its adoption by enterprises is the state of its APIs, namely the issues of API fragmentation across implementations and the lack of interfaces that are accessible to non-programmers.  There’s obviously a lot of activity in the MapReduce world, and both the general paradigm and the open source Hadoop implementation are clearly here to stay.  Vendors are jumping on the bandwagon as fast as they can, but none of their offerings yet appeal to enterprises beyond early adopters.  The first vendor to get this right, and provide both broad ecosystem support and interfaces which are accessible to both the MapReduce programming community and the broader SQL community, will be well positioned in the battle for Big Data supremacy.

This entry was posted in Aster Data, Cloudera, EMC / Greenplum, Hadoop, HortonWorks, MapR, MapReduce. Bookmark the permalink.

3 Responses to It’s The API, Stupid! (Part 3)

  1. Jeff Hammerbacher says:

    Hey Chris,

    You have the ordering wrong. Aster Data and Greenplum (8/08) both followed the announcement of Hive (3/08). Microsoft’s SCOPE was in the mix as well–the paper wasn’t presented until 8/08 at VLDB, but I recall reading it in the spring of 2008, well before the Aster Data and Greenplum announcements.

    Regards,
    Jeff

  2. Chris says:

    Jeff,

    You’re obviously 100% correct as to the order of announcements, although it does raise an interesting question:

    I had used the date of the first stable release of Apache Hive as my reference point, rather than the talk given by the Facebook team at the March 2008 Hadoop Summit, as I was trying to identify the point in time when the vendors/communities consciously sought to “to [tap] into the broader enterprise workforce”. For Greenplum and Aster Data, there can be no question as to when that occurred, but for Hadoop, was it the point that Facebook sought to do that (at or before March 2008), the point when the first stable Apache release occurred (April 2009), or somewhere in between?

    In retrospect, I think it’s reasonable to say that the greater Hadoop community was at least thinking about this question before the Greenplum/Aster Data announcements, so I stand corrected.

    - Chris

  3. Jeff Hammerbacher says:

    Hey Chris,

    Hive was available via Apache in June of 2008: https://issues.apache.org/jira/browse/HADOOP-3601.

    Regards,
    Jeff

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>