The Whys and Hows of Hadoop

| | Comments (0) | TrackBacks (0)

I've been either directly or indirectly involved in a fair number of Hadoop applications and deployments over the last year or so. In that time I've roughly characterized Hadoop usage into two "why's" and two "how's".

The two reasons why companies adopt Hadoop are that they either have too much data to fit on any traditional tool, or they have a large number of applications that benefit from a single filesystem namespace and distributed execution space.

The first reason is pretty straightforward and is why many 'web 2.0' startups have Hadoop installations.

The second reason is why I became involved with Hadoop. The effect of tying any number of nodes into a single virtual computer is just too useful. But I've expanded on this point a couple times here and here. This model is probably best characterized as Platform as a Service (PaaS).

Of note is that once the 'cost' of deploying a Hadoop cluster is paid, more and more applications are developed for it.

Of these kinds of applications, there are two primary use cases. The first and most common case is for querying the data stored on the Hadoop cluster. No shock there.

The second is data integration and processing. That is, pulling large (if not downright huge) data sets (on a schedule) and processing them for load into other systems (like a RDBMS or "key/value" store like HBase).

Some would try to position this as ETL (Extract, Transform, and Load). But there is a much stronger orchestration and workflow component to it typically. More importantly, these processes become First Class citizens in the data-center, not long ignored cron jobs off in a dark corner.

Obviously real world applications don't always fit neatly in these buckets. But these use-cases do inform as to why Pig, Hive, and Cascading exist.

If you are primarily doing ad-hoc queries against large data, you are likely an analyst who is comfortable with SQL like dialects, and are probably only running these queries once or a few times before moving on.

For this, against large data, Pig and Hive are attractive. That said, they probably make little or no sense used with the PaaS model, might as well stick with a RDBMS.

But, if you need to push and clean 100G of data every day into an Aster Data nCluster (for mission critical complex analysis), or you need to build a web-crawler that better fits your business model, you are likely a developer and are likely building production business critical applications.

For this, Cascading is your only choice if you don't want to write raw MapReduce applications.

And with Amazon's recently announced Elastic MapReduce framework, the 'cost' of deploying a production Hadoop cluster has just come down for those not wanting to know the details of how a Hadoop cluster is configured and maintained.

The great thing is that Cascading supports this model quite well, and officially supports Elastic MapReduce. There is even a simple command line tool for slicing and dicing large datasets called MultiTool (source code).


0 TrackBacks

Listed below are links to blogs that reference this entry: The Whys and Hows of Hadoop.

TrackBack URL for this entry: http://www.manamplified.org/cgi-bin/mt-tb.cgi/415

Leave a comment