Even though I am not a Hadoop team member or committer, I do spend good amount of time visiting companies evangelizing Hadoop, and of course, Cascading. But my message is not about how fast Hadoop applications run, that would be missing the point of Hadoop almost entirely.
Hadoop scales, and it scales linearly. It scales in storage capacity, and in compute capacity.
Along with scaling, it is fault tolerant in the face of hardware failures.
This has tremendous implications for the Enterprise. Of these points I've mentioned before when describing Wide Virtualization.
Most developers are attracted to Hadoop to solve their Big Data / Complex Algorithm problem. That is, they have this huge file they can't really fit onto a RDBMS, they have some complex routines in python that do not map into SQL, or most commonly they have both issues.
So they boot up Hadoop and after much fussing about, they have a solution and it leaks into production. Sometimes they even notice it is still running even though a node or disk has failed and no one noticed.
But there is another side to this that is much more interesting. This is where Google gets its true efficiencies. That is MapReduce+GFS enabled Google to process web scale data, but it also simplified their data center.
A Hadoop cluster has two main components. A Global or Universal Filesystem Namespace, and a Global Execution-space. Together, they make a single computer. A computer that can grow or shrink in capacity. Shrinking whether it was intended to or not.
Why is this important?
First, it simplifies the jobs of the operations folk. They only manage and deploy Hadoop for data processing and analytics. And if there are hardware failures, there is no serious rush to get the node online. Also there is no more NFS and all the mess associated with security and networking.
Second, it removes much variability in the "how to deploy" your data processing application. You don't need to find free disk space to store authoritative data, intermediate data, or result data. And you don't need to find a machine that isn't already overused to put and execute your binaries on. Nor do you need to include the data center operations or DBA folk in your deployment planning, unless you need to increase the size of the cluster.
Just this improves asset utilization cluster wide. Jobs run on machines that are currently under utilized. No need to move data or applications when newer or beefier hardware is deployed either, and older machines can still stay in the game. Try that with VMWare and Xen.
Third, dependent applications are coupled by data, not by network topology. The whole "where do I get the data from" question has just been simplified to a filesystem path and data format. The "how" has been abstracted away, where "how" could have been "over http", "over JDBC", etc.
This last point is predicated on whether or not new data is being stuffed onto Hadoop or not. New projects emerging are giving the Hadoop OS syslog like functionality. This is a topic for another post however.
The one thing Hadoop does not help with is providing a simple means to develop real world applications. Hadoop works in terms of MapReduce jobs. But real work consists of many, if not dozens, of MapReduce jobs chained together, working in parallel and serially.
Cascading provides a simple API for developing these applications. More can be read on the website.
And yes, with a cluster of reasonable size, your applications will go very fast. But the take away is that if it isn't fast enough to start with, its compute capacity can be increased. Not an option with traditional architectures without significant expense. Even then there are no guarantees, and still is effectively monolithic and failure prone, the kind were everything stops.
But since Hadoop + Cascading represents a generalized platform for application development and deployment, other applications can migrate to the Hadoop cluster and leverage any low utilization periods left over by higher priority processes.
All together, we now have a proper platform for data processing and analytics, that scales linearly, is tolerant in the face of hardware failures, and has a simple development interface that scales in the developers head, and across projects.
In practice, what we have is a single platform for storing data and deploying applications. A single interface for developers to push code and data. One that can grow or shrink to maintain proper utilization cluster wide, and normalizes heterogenous hardware (old and new) into a single abstraction, ultimately saving people, hardware, and energy costs.
The title of this post mentions Scalability. Scalability isn't just about computing, its also about organizations. Organizations need to scale just as applications do. Scaling is about absorbing the demand put on a system.
Cascading can absorb application complexity, Hadoop can absorb data center complexity, and these together help organizations absorb demand on their services in part by reducing internal complexity and costs.
On last thought, if you need to learn MapReduce or, more importantly, just want to get on the right foot with Hadoop, sign your company up for a class over at Scale Unlimited. In full disclosure, I wrote and teach the Hadoop Boot Camp class. But don't let that stop you.