Repost of "How fast is Cascading compared to Pig or Hive?"

| | Comments (0) | TrackBacks (0)

Here is a repost of an answer I gave on Quora: "How fast is Cascading compared to Pig or Hive?"

Assuming you are actually creating business processes that need to be optimized because they are run repeatedly over new and existing data, Cascading will almost always be faster in the long run.

This is true because the developer is given direct access to the query planner and re-usable processing operations instead of having them hidden behind a feature poor syntax (Mahout exists for a reason).

As you begin to understand your data and your business goals, you can implement optimizations that can’t be hidden or abstracted away. Computing is always messy and abstractions never perfect, see the “Law of Leaky Abstractions”.

http://www.joelonsoftware.com/articles/LeakyAbstractions.html

For example, “Parallel Set-Similarity Joins” can easily be implemented with Cascading in hours in order to shave days off processing time. As can iterative processes like finding PageRank and Connected Components of a graph.

http://asterix.ics.uci.edu/fuzzyjoin-mapreduce/

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5076317&tag=1

Because Cascading is an API, you can build reusable frameworks, libraries, and write unit tests in any JVM based language. See Cascading.JRuby, Cascading-Clojure, Cascalog and Bixo for great examples (many of which have great user/developer communities and consultants).

http://www.cascading.org/modules.html

Companies that are in the business of data are re-tooling their existing frameworks or building new infrastructures directly on Cascading.

http://aws.amazon.com/solutions/case-studies/razorfish/

http://www.sdtimes.com/blog/post/2009/08/18/9-reasons-FlightCaster-s-the-Future.aspx

http://delicious.com/cwensel/cascading

The developers at these companies and many others are building their businesses on Cascading and Hadoop powered frameworks tailored specifically for their business needs because they have more control, flexibility, and opportunities for state of the art optimizations.

0 TrackBacks

Listed below are links to blogs that reference this entry: Repost of "How fast is Cascading compared to Pig or Hive?".

TrackBack URL for this entry: http://www.manamplified.org/cgi-bin/mt-tb.cgi/423

Leave a comment