As a concrete extension to my thoughts on Wide Virtualization, I've started a new project called Cascading. Simply, it is a pipe and filter abstraction over map/reduce as implemented by Hadoop.
What's interesting here is that pipe assemblies are 'compiled' into an optimized set of map/reduce jobs at runtime that are then executed on a Hadoop cluster. Among many benefits, this provides for reusability and concise code.
Unlike Pig and Jaql, it is an API, not a new language. Subsequently, Pig and Jaql could be used as participants in a user defined assembly. Thus they are complimentary to Cascading.
Further, Cascading allows the user to compose reusable pipe assemblies into 'make' like processes that only rebuilds target data sets if they are stale. For jobs that can run for days, or result data sets that have many dependencies, this can significantly reduce complexity or running times when there are failures.
As it says on the Cascading website, it is still being readied for public release but we are soliciting alpha and beta testers.
Leave a comment