Posted by: JohnMWillis
amazon, aws, elastic map reduce, haddop
Amazon has offered a new web service that provides Hadoop services called Elastic Map Reduce. Hadoop is a Java based framework that implements Map Reduce. Map Reduce is a method of programming that gives a program the capability of breaking a job up into hundreds or even thousands of separate parallel processes. The idea is that you can take a simple process (like counting the words in a book) and break it up into multiple running parts (i.e., The Map), then collect them all back into summary counts (i.e., The Reduce). This allows a programmer to process extremely large data sets in a timely manner. Hadoop is used by companies like Google, Yahoo, AOL, IBM, Facebook, and Last.fm to name a few.
The new Amazon Elastic Map Reduce service is another pay-as-you-go service that starts at .015 cents per hour and can go as high as .12 cents per hour. This is an additional charge on top of your standard EC2 and S3 services. For example, if you use the Hadoop to read data from S3 and start up 4 EC2 instances, you will be charged the original costs for the EC2 and S3 usage plus an additional charge for the Elastic Map Reduce service. Amazon is charging you for the setup and administration of Hadoop – as a service. The Elastic Map Reduce service is extremely simple to set up and run. Basically, you upload your application and any data you wish to process to an S3 bucket. You then create a job that includes the location on S3 of your input and output data sets and your map reduce program. The current implementation supports writing Hadoop Map Reduce programs in Java, Ruby, Perl, Python, PHP, R, and C++. You then also configure the number of EC2 instances you want to run for the Map Reduce job. You can also add some advanced arguments and use more complex processing methods, if you choose. The AWS Management console has been updated to support a new tab for the Elastic Map Reduce service. This new service hides all of the system administration complexities in setting up a Hadoop environment which can be quite complex. A hadoop setup of multiple systems is not a simple task. Hadoop runs as a cluster of machines with a specific file system called the HDFS (Hadoop File System). It also has a number of servers called Datanodes (these are the job processing clients) and a master server called the Namenode server. Here is a diagram of an HDFS environment:
This morning I had an opportunity to test out the new Elastic Map Reduce service using the Amazon Management Console. I setup one of their sample jobs to do a word count Map Reduce. There is an excellent video from Amazon on how to get started here. Here are some screen shots of my testing this morning:
I also had an opportunity to speak to Don Brown, a local Atlanta entrepreneur and founder of twitpay.me, about the significance of the new Amazon Elastic Map Reduce service. Don pointed out two significant aspects to the new Amazon Elastic Map Reduce announcement. First, by Amazon creating this new service, they have put a higher level of significance on the usage of Map Reduce and Hadoop. Organizations looking to explore new techniques for large data set processing and/or new ways to do data warehousing, might start hearing about Map Reduce and Hadoop. With Amazon’s new service these organizations will start getting Google search result lists that show Amazon as a leading player with Hadoop and Map Reduce. This will add instant credibility to this technology. In Don’s words, “Amazon is sort of like the new IBM when it comes to cloud computing . . . you can’t go wrong with Amazon!”. All of this should speed up the adoption of Map Reduce being used as sort of a new data warehouse technology which Don and I both agree is a good thing. Secondly, Don suggests that the administration it takes an organization to setup Hadoop is a bargain at .015 to .12 cents per hour. He has done a number of Hadoop consulting engagements and says the cost of a consultant to do a Hadoop setup is not cheap.
We both agree, however, despite all the hoopala regarding this new announcement, that it is still hard to implement a Hadoop solution and there are not that many experts. Therefore, as exciting as the new Amazon announcement is, it still only gets you half way there. Learning how to develop code using Map Reduce and Hadoop is a completely different way of thinking than traditional programming paradigms. Most traditional programming shops will have to re-tool to take advantage of this new paradigm. On an upside, all freshman CS students at the University of California Berkley are required to learn Hadoop in their freshmen semesters. All and all this new announcement, in my opinion, puts Amazon in a class of their own when it comes to “Cloud Computing.”