The prolific tech people at Facebook have started the process of contributing their Hive framework to Apache. You can read about it in HAPOOP-3601. To quote, "Hive is a data warehouse built on top of flat files (stored primarily in HDFS)." Since the most common question I get is how does Cascading relate to Pig, I thought it best to offer up my comments on how Cascading relates to Hive.
After reading the Hive tutorial, the previous quote nicely sums it up. It is data-warehousing for Hadoop.
First off let me say the Hive team have done an outstanding job. The idea of Pig is immediately comprehensible, provides a concise language for interacting with data in a Hadoop cluster via a roughly SQL syntax. This is immediately attractive to managers who fear the risk of introducing something as foreign as MapReduce to their environment. Hive has clearly taken this to a new level.
The Hive team has taken what's familiar to the Enterprise and layered it over Hadoop. Hive could be one of the tools that helps bring Hadoop into the traditional Enterprise to compete with incumbents like Oracle.
But I see three difficulties. First, data-warehouses are simply a pre-populated cache. Second, the cached data is typically stuffed into a 'one size fits all' schema. Third, manipulation of the cached data is via a narrow and incomplete syntax.
Let me try to briefly explain my issues.
The typical data-warehouse has four components. The original source data, the ETL processes that load the raw data into the warehouse, the warehouse database itself (schema and stored data), and reporting and mining tools.
In this model, there are two useful architectural components being employed. First, the reduced coupling through a hub and spoke data model. That is, raw data is normalized into a common schema so external applications are insulated from changes to the systems producing the raw data. But due to resource constraints, the schema tends to be designed as 'one size fits all' (think star schema or dimensional models).
Second, is the data-warehouse as a cache. The schema reduces coupling, but for performance sake, the data is pre-loaded, whether it is needed or not.
You are always pre-caching the data, and because of the size of the data and time it takes to process, developers amortize the cost by using a generic data structure to store the data. Forcing some data to be processed and stored that isn't currently needed.
Instead of a schema, what would be useful is a View or a Facade. A Facade kinda like a Schema for programming interfaces. Applications couple to the Facade implementation, and behind the scenes the Facade manages the complexity to various interfaces it abstracts. IoC applications are a great example (see Spring Framework).
With Hadoop, the developer no longer needs to build caching into the architecture. A properly designed system will store raw data on a Hadoop filesystem, and have the ETL processes really look like Facades (think views of views) to the data for dependent applications and processes.
This way the developer can choose how much data to expose to dependent applications, or the framework can adapt on the fly and optimize the processing.
Further, if this is provided through a feature complete programming language, there won't be any restrictions on the kinds of processing, or the need to augment the original queries from the query language with another full language. The later typically manifests as stand-alone ETL used to feed the query engine or extract query results.
So let me phrase this a bit differently.
If you start with stored raw data, developers can layer facades over the data that render out the raw data into the data they need. Over time, new applications can couple to this facade, or more abstract ones can be refactored out. Effectively you end up with a tree of dependencies from the raw data to the various applications using the data. This satisfies the hub/spoke model of managing complexity, especially where the trunk of the tree is thickest.
If some nodes in this tree are heavily used, they can be cached as intermediate data transparently, this would not require an architectural change, just how the data is passed from one node to the next through the tree.
Further, if the raw data changes in some form, the facades can be used to insulate the changes.
Hadoop allows the developer to make fewer decisions based on being resource constrained because data isn't tied to a particular cpu or disk array.
Now data can be lazily evaluated, and where it isn't being evaluated fast enough, it can be cached for use.
This is the fundamental model behind Cascading.
To sum up...
Hive enforces a static model of data, and pushes ETL processes off to the side, possibly increasing complexity of the whole system.
Cascading takes a lazy view, and allows the developer to decide how she wants to couple data, and whether or not to cache the data between dependent processes.
That all said, Hive isn't wrong, there are many applications where the static data-warehouse model is best. But there are many where this legacy architecture model isn't best. So in applications where you really do need Hive, make sure you use Cascading to define the processes that load and extract data to and from Hive.
Hi Chris,
Great essay.
From what I understand of Hive, it sort of does work like a stream. A table in Hive is really just a 'view' into raw data. An HQL query is translated into a number of MapReduce jobs which internally probably look something similar to a flow in Cascading.
You mention caching:
How does that work in Cascading? Is it smart enough to not run a stage in the pipes if, say, the input data is older than the output data?
Yes, when a set of Flow instances are run via a Cascade, they will not be re-run if their sinks aren't stale or the 'skip strategy' says they aren't stale.
Just like a make/ant build.