Recently in Tools Category

When is Hadoop Justified

| | Comments (1)

At the SF Hadoop User Group last night, a question was posed as to what factors justify the use of an Apache Hadoop cluster vs. traditional approaches.

The answer isn't black and white but can be broken down into three intertwined heuristics.

Hadoop is more likely justified,

  • the larger the corpus of data needed to satisfy the business problem (big-data).
  • the more complex the processes and algorithms required to satisfy the business problem.
  • and, the more distinct business problems need concurrent or overlapping access to a corpus of data (multi-tenancy).

Thus Hadoop isn't strictly about huge data-sets, but also about absorbing complexity while maintaining scale.

FlightCaster, a Cascading user, doesn't have huge amounts of data, but they do have a very hard business problem, and Hadoop for them is completely justified.

Facebook on the other hand has huge data, but their tool of choice, Hive, doesn't encourage solutions to complex problems by virtue of being a syntax and SQL based. Of which was initially used to extract small data from the cluster for use by other systems or custom Hadoop jobs.

From Secrets of BackType's Data Engineers:

Cascalog is one of their secret weapons, a Clojure-based query language for Hadoop that makes it simple for them to analyze their data in new ways. Inspired by the venerable Datalog, and built on top of Cascading, it allows you to write queries in Clojure and define even complex operations in simple code. Unlike alternatives like Pig or Hive, it's written within a general-purpose language, so there's no need for separate user-defined functions, but it's still a highly-structured way of defining queries.

It's worthy of note that Cascalog is a distant child of cascading-clojure, created and used by FlightCaster. FlightCaster was acquired this week.

Git - My Favorite Feature

| | Comments (0)

I've been using Git for some months now. Even though it isn't natively supported in any of my tools, and the plugins are a bit buggy, it just really doesn't matter. And the reason is pretty simple.

DocBook Tools on GitHub

| | Comments (0)

I've been doing a fair bit with DocBook recently. Both the Cascading User Guide and a section on Cascading, I hope to be included in the upcoming Hadoop: The Definitive Guide, were written in DocBook. Unfortunately, finding a reasonable DocBook tool chain was difficult, so I had to adopt the Velocity DocBook Framework and make some modifications. I've published a draft of my efforts on GitHub: DocBook Framework and DocBook Template.

TeamCity + Amazon EBS

| | Comments (0)

Within moments of checking my inbox and seeing Amazon finally released its Elastic Block Store for EC2, I jumped over to RightScale and saw they already had support implemented. Within a couple hours, I had TeamCity installed on a volume, and now have myself an on demand continuous integration server for remote testing.

Hadoop vs GridGain

| | Comments (0)

Thought I would quickly post this link to the Hadoop wiki comparing GridGain to Hadoop. In summary, Hadoop was designed for large data applications. GridGain is simply a re-imagining of tuple-spaces with constraints on available JVM memory (as implied by the comparison). Hopefully I'll post my own opinions at a later date. [Update] A reaction to the comparison has been posted.[Update Sept 2008] GridGain isn't even a data-grid, but a means to distribute apps into running remote kernels, with a bit of Spring like pluggability. A comparison is disingenuous.

There is much buzz regarding the recently announced feature additions to EC2. Namely about Elastic IP Addresses and Availability Zones. But if you look closely, you will see there were 4 new features added. User Selectable Kernels and New Public AMIs and Kernels (32bit and 64bit).

Notes On Using EC2 and S3

| | Comments (0)

Thought I would share a few helpful hints to keep in mind when using EC2 and S3. Nothing mind blowing here, just some things worthy of note to the beginner. All of them born of fire managing Cascading / Hadoop clusters.

Clouds Gathering

| | Comments (0)

I'm not a fan of the name 'Cloud Computing'. As a metaphor is dissipates rather quickly. Nevertheless, IBM has recently announced their new cloud initiative, Blue Cloud. And Sun will announce the private beta of theirs tomorrow (Feb 21), Project Caroline. Competition with Amazon is welcome. But more welcome is direct support for Hadoop in both of these infrastructures. And by virtue, more reasons to use Cascading.

Hadoop Machine Learning

| | Comments (1)

Looks like a new Lucene sub-project has just been announced named Mahout.