There is much buzz regarding the recently announced feature additions to EC2. Namely about Elastic IP Addresses and Availability Zones. But if you look closely, you will see there were 4 new features added. User Selectable Kernels and New Public AMIs and Kernels (32bit and 64bit).
Why does this matter?
Well, the modern and recompiled kernels are better tuned for network throughput at higher CPU loads. This coupled with specifying a specific availability zone, Hadoop scales much better.
Before using the new Kernel and pinning my clusters to a specific availability zone, I was unable to scale my clusters past 40 nodes without various HDFS networking issues. Afterwards I was able to hit 50 nodes without a single connection/write failure or retry.
For grins I may try a larger cluster with my test datasets, but doubling the number of nodes in a Hadoop cluster doesn't necessarily double the speed. Especially for complex work loads with a reasonable number of dependencies between jobs (currently Cascading is rendering out 43 map/reduce jobs, from 31 Flows, managed under a single Cascade).
If interested in using Hadoop with the new EC2 features, see my Hadoop contrib/ec2 patch, HADOOP-2410. It builds an AMI with ganglia configured, so you can see first hand your cluster utilization and decide to increase/decrease your cluster size.
> hadoop-ec2 Usage: hadoop-ec2 COMMAND where COMMAND is one of: list list all running Hadoop EC2 clusters launch-cluster <group> <num slaves> launch a cluster of Hadoop EC2 instances - launch-master then launch-slaves launch-master <group> launch or find a cluster master launch-slaves <group> <num slaves> launch the cluster slaves terminate-cluster terminate all Hadoop EC2 instances login <group|instance id> login to the master node of the Hadoop EC2 cluster screen <group|instance id> start or attach 'screen' on the master node of the Hadoop EC2 cluster proxy <group|instance id> start a socks proxy on localhost:6666 (use w/foxyproxy) push <group> <file> scp a file to the master node of the Hadoop EC2 cluster <shell cmd> <group|instance id> execute any command remotely on the master create-image create a Hadoop AMIUsing the 'proxy' command will let you see the ganglia reports. See my notes on EC2 for more details.
Leave a comment