Obligatory EC2 Remarks, Hadoop Clusters

| | Comments (0) | TrackBacks (1)

There is much buzz regarding the recently announced feature additions to EC2. Namely about Elastic IP Addresses and Availability Zones. But if you look closely, you will see there were 4 new features added. User Selectable Kernels and New Public AMIs and Kernels (32bit and 64bit).

Why does this matter?

Well, the modern and recompiled kernels are better tuned for network throughput at higher CPU loads. This coupled with specifying a specific availability zone, Hadoop scales much better.

Before using the new Kernel and pinning my clusters to a specific availability zone, I was unable to scale my clusters past 40 nodes without various HDFS networking issues. Afterwards I was able to hit 50 nodes without a single connection/write failure or retry.

For grins I may try a larger cluster with my test datasets, but doubling the number of nodes in a Hadoop cluster doesn't necessarily double the speed. Especially for complex work loads with a reasonable number of dependencies between jobs (currently Cascading is rendering out 43 map/reduce jobs, from 31 Flows, managed under a single Cascade).

If interested in using Hadoop with the new EC2 features, see my Hadoop contrib/ec2 patch, HADOOP-2410. It builds an AMI with ganglia configured, so you can see first hand your cluster utilization and decide to increase/decrease your cluster size.

> hadoop-ec2
Usage: hadoop-ec2 COMMAND
where COMMAND is one of:
  list                                 list all running Hadoop EC2 clusters
  launch-cluster <group> <num slaves>  launch a cluster of Hadoop EC2 instances - launch-master then launch-slaves
  launch-master  <group>               launch or find a cluster master
  launch-slaves  <group> <num slaves>  launch the cluster slaves
  terminate-cluster                    terminate all Hadoop EC2 instances
  login  <group|instance id>           login to the master node of the Hadoop EC2 cluster
  screen <group|instance id>           start or attach 'screen' on the master node of the Hadoop EC2 cluster
  proxy  <group|instance id>           start a socks proxy on localhost:6666 (use w/foxyproxy)
  push   <group> <file>                scp a file to the master node of the Hadoop EC2 cluster
  <shell cmd> <group|instance id>      execute any command remotely on the master
  create-image                         create a Hadoop AMI
Using the 'proxy' command will let you see the ganglia reports. See my notes on EC2 for more details.

1 TrackBacks

Listed below are links to blogs that reference this entry: Obligatory EC2 Remarks, Hadoop Clusters.

TrackBack URL for this entry: http://www.manamplified.org/cgi-bin/mt-tb.cgi/385

Amazon CTO Werner Vogel's recent announcement of new high-availability features in EC2 (Amazon Elastic Compute Cloud) drew some attention. And indeed, being finally able to manage IP addresses for EC2 services removes one of the biggest drawbacks EC Read More

Leave a comment