This blog isn't dead yet, just spending my energy growing my company.
If interested in helping, we are seed funded and hiring.
This blog isn't dead yet, just spending my energy growing my company.
If interested in helping, we are seed funded and hiring.
A 40 minute video of me butchering common patterns in MapReduce at Buzzwords this year.
Chris Wensel COMMON MAPREDUCE PATTERNS from ntc GmbH on Vimeo.
I was approached by a fellow at Strata who asked me why Cascading wasn't free or open-source. I replied that it is, "it is open-source and totally free". He then asked me why I don't accept patches, clearly a misconception that's been floating around (see the end). Of course I replied, "because your patch will suck". Then I explained why.
At the SF Hadoop User Group last night, a question was posed as to what factors justify the use of an Apache Hadoop cluster vs. traditional approaches.
The answer isn't black and white but can be broken down into three intertwined heuristics.
Hadoop is more likely justified,
Thus Hadoop isn't strictly about huge data-sets, but also about absorbing complexity while maintaining scale.
FlightCaster, a Cascading user, doesn't have huge amounts of data, but they do have a very hard business problem, and Hadoop for them is completely justified.
Facebook on the other hand has huge data, but their tool of choice, Hive, doesn't encourage solutions to complex problems by virtue of being a syntax and SQL based. Of which was initially used to extract small data from the cluster for use by other systems or custom Hadoop jobs.
From Secrets of BackType's Data Engineers:
Cascalog is one of their secret weapons, a Clojure-based query language for Hadoop that makes it simple for them to analyze their data in new ways. Inspired by the venerable Datalog, and built on top of Cascading, it allows you to write queries in Clojure and define even complex operations in simple code. Unlike alternatives like Pig or Hive, it's written within a general-purpose language, so there's no need for separate user-defined functions, but it's still a highly-structured way of defining queries.
It's worthy of note that Cascalog is a distant child of cascading-clojure, created and used by FlightCaster. FlightCaster was acquired this week.
I've just pushed Cascading 1.2 up. It has a number of performance improvements everyone will benefit from out of the box.
Here is a repost of an answer I gave on Quora: "How fast is Cascading compared to Pig or Hive?"
The new Strata Conference has just been announced with a Call for Proposals ending Sept 28. This new conference is on the 'business of data' and is the sister conference to Velocity. I'm excited to be a committee member.
I'll be at the (and a sponsor of) the BigDataCamp the night before the Hadoop Summit. Sign up if you haven't.
The first ever Cascading User Group will be this Thursday, September 24th, at RapLeaf.
There will be discussions on the future of Cascading, the work done by the FlightCaster folk integrating Cascading with Clojure, and various tips and techniques.
Hope to see you there.