For years I've used or relied upon virtualization technologies such as User Mode Linux, FreeBSD Jail, VMWare, Parallels, and Xen (by virtue of EC2). All of these represent a kind of vertical or narrow virtualization, where you have a single box and are sliding into it as many virtual systems as possible. The converse to this is a kind of wide virtualization that spanning many machines.
Again, with narrow virtualization, you have one machine, but it runs many virtual machines. When the utilization of a given application is less than the capacity of its host, it makes sense to get the most from your hardware as possible by adding more applications. Complexity/coupling is managed through sandboxing these applications into their own operating system.
Wide virtualization makes sense when your application needs more resources than a single host machine can provide. Or you want to reduce system wide complexity by not coupling applications to particular machines, but are already coupled by shared libraries or data.
A distributed filesystem is a kind of wide virtualization. The files live in the global file space, their being persisted on a particular piece of hardware is an implementation detail abstracted away from the consuming applications. A key feature usually found in these systems is replication of data across many specific machines to mitigate data loss in the face of hardware failure.
But more interesting is wide virtualization as a global execution space. Grid and parallel processing clusters are such an instance. Here execution continues in parallel across available resources and by virtue are resistant to node failure.
Coupling both a distributed file system and a parallel processing execution environment, you end up with tools like Google MapReduce, or the Hadoop framework.
If you consider this coupling a sort of operating system, you have yourself a virtual machine spanning many physical machines. Along with concepts like a Perpetual Computing Cluster, you have a very durable environment for the continuous processing of data, both resilient to failure and to environmental change. If you will, a Wide Virtual Machine.
This seems most easily achieved by ironically running a Wide Virtual Machine over a cluster of Narrow Virtual Machines. Hadoop over EC2.
This is the kind of virtual world I would rather have my IT staff spawning one-off awk jobs, or have business dependent cron jobs pegged. Not randomly distributed, and under documented, linux systems, that when fail, forgotten business processes stop.
Obviously the next problem is how to partition (sandbox) "co-located" applications in such a global context. But the adoption of such systems within trusted networks is worthy first step.
And to do that, the programming model these system expose to developers can be somewhat complex. This is where Cascading fits in. It abstracts out much of the complexity and offers developers a means to rapidly build scale free data processing applications.
If you have a large cluster, you obviously need your data out of it fast, but there is no value if you can't develop applications, in a timely manner, to render that data. Or if you build up so much complexity over time from various one off applications, the whole system becomes brittle.