Cascading and Apache PIG

| | Comments (0) | TrackBacks (0)

A recent thread on the Hadoop mailing list prompted me to throw out some quick comments on Cascading and how it compares to Apache PIG. I'm reposting those comments here with minor edits for clarity.

At one level Cascading is a MapReduce query planner, just like PIG. Except the Cascading API is for public consumption and fully extensible, in PIG you typically interact with the PigLatin text syntax. With Cascading, you can layer your own syntax on top of the API. Currently there is Groovy support (Groovy is used to assemble the work, it does not run on the mappers or reducers). I hear rumors about Jython elsewhere.

A couple groovy examples (note these are obviously trivial, the dsl can absorb tremendous complexity if need be)...

Since Cascading is in part a 'planner', it actually builds internally a new representation from what the developer assembled via the API and renders out the necessary map/reduce jobs (and transparently links them) at runtime. As Hadoop evolves, the planner will incorporate the new features and leverage them transparently. Plus there are opportunities for identifying patterns and applying different strategies (hypothetically map side vs reduce side joins, for one). It is also conceivable (but untried) that different planners can exist to target different systems other than Hadoop (making your code/libraries portable). Much of this is true for PIG as well.

Cascading Technical Overview

Also, Cascading will at some point provide a PIG adapter, allowing PigLatin queries to participate in a larger Cascading 'Cascade' (the topological scheduler). Cascading is also great with integration, connecting things outside Hadoop with stuff to be done inside Hadoop. And PIG looks like a great way to concisely represent a complex solution and execute it. There isn't any reason they can't work together (it has always been the intention).

The takeaway is that with Cascading and PIG, users do not think in MapReduce. With PIG, you think in PigLatin. With Cascading, you can use the pipe/filter based API, or use your favorite scripting language and build a DSL for your problem domain.

Many companies have done similar things internally, but they tend to be nothing more than a scriptable way to write a map/reduce job and glue them together. You still think in MapReduce, which in my opinion doesn't scale well.

My (biased) recommendation is this.

Build out your application in Cascading. If part of the problem is best represented in PIG, no worries use PIG and feed and clean up after PIG with Cascading. And if you see a solvable bottleneck, and we can't convince the planner to recognize the pattern and plan better, replace that piece of the process with a custom MapReduce job (or more).

Solve your problem first with Cascading, then optimize the solution, if need be.

0 TrackBacks

Listed below are links to blogs that reference this entry: Cascading and Apache PIG.

TrackBack URL for this entry: http://www.manamplified.org/cgi-bin/mt-tb.cgi/396

Leave a comment