Recently in Technical Category
At the SF Hadoop User Group last night, a question was posed as to what factors justify the use of an Apache Hadoop cluster vs. traditional approaches.
The answer isn't black and white but can be broken down into three intertwined heuristics.
Hadoop is more likely justified,
- the larger the corpus of data needed to satisfy the business problem (big-data).
- the more complex the processes and algorithms required to satisfy the business problem.
- and, the more distinct business problems need concurrent or overlapping access to a corpus of data (multi-tenancy).
Thus Hadoop isn't strictly about huge data-sets, but also about absorbing complexity while maintaining scale.
FlightCaster, a Cascading user, doesn't have huge amounts of data, but they do have a very hard business problem, and Hadoop for them is completely justified.
Facebook on the other hand has huge data, but their tool of choice, Hive, doesn't encourage solutions to complex problems by virtue of being a syntax and SQL based. Of which was initially used to extract small data from the cluster for use by other systems or custom Hadoop jobs.
Click through for a couple Hadoop related videos.
A couple interesting quotes from the Hadoop user list. Not indicative of anything in particular, but noteworthy.
Hadoop 0.17.0 is now generally available. This also means there are new scripts for managing EC2 clusters using the new EC2 features like 'availability zones', the new optimized kernels, 32 and 64 bit images, and Ganglia. Also looks like Tom has already packaged new public AMI's as well. You can read about the changes here on the Hadoop Wiki EC2 page. Here also is the JIRA issue with the patches.