At the SF Hadoop User Group last night, a question was posed as to what factors justify the use of an Apache Hadoop cluster vs. traditional approaches.
The answer isn't black and white but can be broken down into three intertwined heuristics.
Hadoop is more likely justified,
- the larger the corpus of data needed to satisfy the business problem (big-data).
- the more complex the processes and algorithms required to satisfy the business problem.
- and, the more distinct business problems need concurrent or overlapping access to a corpus of data (multi-tenancy).
Thus Hadoop isn't strictly about huge data-sets, but also about absorbing complexity while maintaining scale.
FlightCaster, a Cascading user, doesn't have huge amounts of data, but they do have a very hard business problem, and Hadoop for them is completely justified.
Facebook on the other hand has huge data, but their tool of choice, Hive, doesn't encourage solutions to complex problems by virtue of being a syntax and SQL based. Of which was initially used to extract small data from the cluster for use by other systems or custom Hadoop jobs.