I was approached by a fellow at Strata who asked me why Cascading wasn't free or open-source. I replied that it is, "it is open-source and totally free". He then asked me why I don't accept patches, clearly a misconception that's been floating around (see the end). Of course I replied, "because your patch will suck". Then I explained why.
At the SF Hadoop User Group last night, a question was posed as to what factors justify the use of an Apache Hadoop cluster vs. traditional approaches.
The answer isn't black and white but can be broken down into three intertwined heuristics.
Hadoop is more likely justified,
- the larger the corpus of data needed to satisfy the business problem (big-data).
- the more complex the processes and algorithms required to satisfy the business problem.
- and, the more distinct business problems need concurrent or overlapping access to a corpus of data (multi-tenancy).
Thus Hadoop isn't strictly about huge data-sets, but also about absorbing complexity while maintaining scale.
FlightCaster, a Cascading user, doesn't have huge amounts of data, but they do have a very hard business problem, and Hadoop for them is completely justified.
Facebook on the other hand has huge data, but their tool of choice, Hive, doesn't encourage solutions to complex problems by virtue of being a syntax and SQL based. Of which was initially used to extract small data from the cluster for use by other systems or custom Hadoop jobs.
Cascalog is one of their secret weapons, a Clojure-based query language for Hadoop that makes it simple for them to analyze their data in new ways. Inspired by the venerable Datalog, and built on top of Cascading, it allows you to write queries in Clojure and define even complex operations in simple code. Unlike alternatives like Pig or Hive, it's written within a general-purpose language, so there's no need for separate user-defined functions, but it's still a highly-structured way of defining queries.
I've just pushed Cascading 1.2 up. It has a number of performance improvements everyone will benefit from out of the box.
Here is a repost of an answer I gave on Quora: "How fast is Cascading compared to Pig or Hive?"