A 40 minute video of me butchering common patterns in MapReduce at Buzzwords this year.
Chris Wensel COMMON MAPREDUCE PATTERNS from ntc GmbH on Vimeo.
A 40 minute video of me butchering common patterns in MapReduce at Buzzwords this year.
Chris Wensel COMMON MAPREDUCE PATTERNS from ntc GmbH on Vimeo.
At the SF Hadoop User Group last night, a question was posed as to what factors justify the use of an Apache Hadoop cluster vs. traditional approaches.
The answer isn't black and white but can be broken down into three intertwined heuristics.
Hadoop is more likely justified,
Thus Hadoop isn't strictly about huge data-sets, but also about absorbing complexity while maintaining scale.
FlightCaster, a Cascading user, doesn't have huge amounts of data, but they do have a very hard business problem, and Hadoop for them is completely justified.
Facebook on the other hand has huge data, but their tool of choice, Hive, doesn't encourage solutions to complex problems by virtue of being a syntax and SQL based. Of which was initially used to extract small data from the cluster for use by other systems or custom Hadoop jobs.
The new Strata Conference has just been announced with a Call for Proposals ending Sept 28. This new conference is on the 'business of data' and is the sister conference to Velocity. I'm excited to be a committee member.
Click through for a couple Hadoop related videos.
A couple interesting quotes from the Hadoop user list. Not indicative of anything in particular, but noteworthy.
Hadoop 0.17.0 is now generally available. This also means there are new scripts for managing EC2 clusters using the new EC2 features like 'availability zones', the new optimized kernels, 32 and 64 bit images, and Ganglia. Also looks like Tom has already packaged new public AMI's as well. You can read about the changes here on the Hadoop Wiki EC2 page. Here also is the JIRA issue with the patches.
Thought I would share a few helpful hints to keep in mind when using EC2 and S3. Nothing mind blowing here, just some things worthy of note to the beginner. All of them born of fire managing Cascading / Hadoop clusters.
Check out theinfo.org, it's "for people with large data sets".
A couple quick links worth sharing. First is an article in BusinessWeek discussing in part how Hadoop is entering the classroom, in Wisdom of Clouds. Second, Communications of the ACM has a brief perspective on MapReduce, in The Data Center Is The Computer.
Hoping to get myself a python binary that runs on my Infrant NAS device, I built out a cross compiler on an EC2 instance and created an AMI for it. Now I have a python sparc binary compiled on Linux with the Infrant patches for glibc.