Hadoop Database for Big Data – Part III
For insert-only (in most cases at least) situations, involving crawling and indexing (both in the web and in Enterprises), blogs/wikis, facebook-like applications, search based retrieval (as against query based), batch-oriented or in-memory aggregations and computations, Wide column stores like Hadoop with KVP support would be relevant.
Hadoop provides a simplified programming model which allows the user to quickly write and test distributed systems and its efficient, automatic distribution of data and work across machines and in turn utilizing the underlying parallelism of the CPU cores.
Hadoop provides no security model, nor safeguards against maliciously inserted data. It is designed to handle hardware failure and data congestion issues very robustly.
Like Bigtable, Hadoop uses the Master-slave model where the master could become the single point of failure. Since version 0.20.0 HBase supports multiple Masters to provide higher availability. The multi-master feature introduced does not add cooperating Masters; there is still just one working Master while the other backups wait. For example, if you start 20 Masters only 1 will be active while the others wait for it to die. The switch usually takes zookeeper.session.timeout plus a couple of seconds to occur.
A ideal highly available cluster would have 5 or more dedicated Zookeeper servers, 2-3 dedicated Master servers (one per rack for example), 1 very reliable Namenode/Job Tracker server with redundant hardware and the rest is the usual Datanode/Task Tracker/Region Server stack.
Yahoo clusters can be used to do scaling tests to support development of Hadoop on larger clusters. Yahoo developer network also provides a deep insight on the best practices to be used for Hadoop: http://developer.yahoo.com/blogs/hadoop/posts/2010/08/apache_hadoop_best_practices_a/
Other points on Hadoop
- Hadoop was created by Doug Cutting, who named it after his son’s stuffed elephant.
- Yahoo is the chief contributor of Hadoop
- In 2008, Hadoop won Terabyte Sort Benchmark (the first time for a Java or an open source program). One of Yahoo’s Hadoop clusters sorted 1 terabyte of data in 209 seconds, beating the previous record of 297 seconds.
- Hadoop is in production use at A9.com, AOL, Booz Allen Hamilton, EHarmony, eBay, Facebook, Fox Interactive Media, Freebase, IBM, ImageShack, ISI, Joost, Last.fm, LinkedIn, Meebo, Metaweb, The New York Times, Ning, Powerset (part of Microsoft now), Rackspace, StumbleUpon, Twitter and Veoh.
- List of Hadoop use-cases is available at http://wiki.apache.org/hadoop/PoweredBy.
- The IBM Distribution of Apache Hadoop is a joint project between the IBM Software Group Emerging Technology team and the Information Management analytics development team.
- IBM’s new portfolio of products – IBM InfoSphere BigInSights – for the analysis and visualization of Big Data is powered by Apache Hadoop.
- IBM Big Sheets is a non-relational data-analysis tool that provides a spreadsheet-style analysis front end for Hadoop.
- Cloudera’s Distribution for Hadoop (CDH) is the commercial offering that provides additional features like job scheduling, workflow sequencing and the ability to control streaming data sources.
- Cloudera’s Enterprise is aimed at helping companies put Hadoop into production. The offering consists of CDH plus a super-set of management tools and support.