As of late, I've been using JSON libraries to render out complex data structures for values emitted from Map tasks in my Hadoop map/reduce jobs. During prototyping, this makes the task of coupling dependent jobs a bit easier and more resilient.
Consider the fact that if you are using Hadoop, your jobs are likely large in time and space. Hadoop helps with both by providing the map/reduce framework and a distributed file system.
Subsequently you now have a stronger need for loosely typed, future proofed data structures. Re-running jobs solely to update data structures can be a waste of time.
Rendering your map values as JSON text offers a simple way to store arbitrarily complex data so that 'future' applications can read and inspect the data without using a shared (and compatible) Writable data structure.
Obviously you can still make mistakes and need to re-generate your data. But I think there are fewer dimensions contributing to the set of possible mistakes that can be made.
Leave a comment