The Apache Hadoop project – sometimes dubbed as “Bigtable clone” – is open-source software framework for reliable, scalable, distributed computing. Hadoop is the only NoSQL implementation that has the most adoption today. IBM uses as well as distributes Hadoop. IBM’s new portfolio of products, M2 (the enterprise data analysis platform), InfoSphere BigInSights (Visualization of Big Data) are powered by Hadoop.
- HDFS – A distributed file system (similar to GFS) that provides high throughput access to application data
- HBase – A scalable, distribution database (similar to Bigtable) that supports structured data storage for large tables.
- MapReduce – A software framework for distributed processing of large data sets on compute clusters
- Hadoop Common – the common utilities that support the Hadoop subprojects
- Chukwa – A data collection system for managing large distributed systems.
- Pig – A high level data flow language and execution framework for parallel computation (that produces a sequence of MapReduce programs)
- ZooKeeper – A high performance co-ordination service for distributed applications.
Hadoop, in relation to other distributed systems, provides the benefit of Flat Scalability curve. Once a Hadoop program is written and tested for a small number of nodes (say ten), very little work is required to run the same program on a much larger number of nodes. The underlying Hadoop platform will manage the data and hardware resources and provide dependable performance growth proportionate to the number of machines available.
Executing Hadoop on a limited amount of data on a small number of nodes may not be effective as the overhead involved in starting Hadoop programs is relatively high. But as the number of machines increase, the price paid in performance and engineering effort increases linearly (unlike other distributed systems which scale non-linearly).
Hadoop Distributed File System (HDFS)
Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop applications. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware (typically running GNU/Linux). Therefore, detection of hardware faults and quick, automatic recovery from them is a core architectural goal of HDFS.
HDFS assumes simple data coherency issues and enables high throughput data access. Typical HDFS applications need a write-once-read-many access model for files. A file once created, written, and closed need not be changed. There is a plan to support appending-writes to files in the future.
HDFS is designed more for batch processing rather than interactive use by users. The emphasis is on high throughput of data access rather than low latency of data access. HDFS relaxes a few POSIX requirements to enable streaming access to file system data.
A HDFS cluster primarily consists of a NameNode that manages the file system metadata and DataNodes that store the actual data.
An HDFS cluster – has a master / slave architecture – and consists of:
- single NameNode – a master server that manages the file system namespace and regulates access to files by clients. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes.
- number of DataNodes – usually one per node in the cluster, which manage storage attached to the nodes that they run on. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The DataNodes perform block creation, deletion, and replication upon instruction from the NameNode. The DataNodes are also responsible for serving read and write requests from the file system’s clients.
HDFS can be accessed from applications in many different ways. A Java API and a C language wrapper for this Java API are available. In addition, an HTTP browser can also be used to browse the files of an HDFS instance. Work is in progress to expose HDFS through the WebDAV protocol.
HDFS provides interfaces for applications to move themselves closer to where the data is located as it minimizes network congestion and increases the overall throughput of the system.
HDFS Data Organization
HDFS supports write-once-read-many semantics on files. An HDFS file is typically chopped up into 64 MB chunks, and if possible, each chunk is placed on a different DataNode.
HDFS uses client side caching to improve performance. Initially the HDFS client caches the file into a temporary local file to which the application writes are transparently redirected. Once the data reaches one HDFS block size, the NameNode is contacted. The NameNode inserts the file name into the file system hierarchy, allocates a data block and respond to the client with the identity of the DataNode and the destination data block. The client directly flushes the block of data from the local temporary file to the specified DataNode. When the file is closed, the remaining data is transferred to the DataNode and the client informs the NameNode. The NameNode commits the file creation into a persistent store. Hence the file is lost if the NameNode dies before the file is closed.
HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. The block size and replication factor are configurable per file. Files in HDFS are write-once and have strictly one writer at any time.
Typically large Hadoop clusters are arranged in racks and network traffic between different nodes within the same rack is much more desirable than network traffic across the racks. HDFS uses a rack-aware replica placement policy to improve data reliability, availability and network bandwidth utilization.
The NameNode periodically receives the following from the DataNodes:
- a Heartbeat implying that the DataNode is functioning properly and
- a Blockreport containing the list of all blocks on a DataNode
The NameNode determines the rack id of each DataNode. Using a simple (but non-optimal) policy, it places the replicas on unique racks preventing losing data when an entire rack fails and allows use of bandwidth from multiple racks when reading data.
Evenly distributing replicas across racks in the cluster makes it easy to balance load on component failure. But this increases the cost of writes as a write needs to transfer blocks to multiple racks. So if the replication factor is three, HDFS’s policy is to put one replica in the local rack, another on a node in a remote rack and the last one on a different node in the same remote rack. So instead of three racks, it now involves only two racks cutting the inter-rack write traffic. As the chance of rack failure is far less than the node failure, this does not impact data reliability and availability guarantees.
The typical placement policy can be stated as “One third of replicas are on one node, one third of replicas are on one rack, and the other third are evenly distributed across the remaining racks”. This policy improves write performance without compromising data reliability or read performance.
To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader (in the order of same node, same rack, local data center and remote replica).
Due to multiple competing considerations – placing replicas on the same node as application, spreading replicas across racks, preferring same rack as the node, spreading data uniformly across the cluster, nodes added to the cluster – data might not be uniformly placed across the DataNodes. HDFS provides a tool that analyzes block placement and allows the administrator to rebalance data across the DataNodes. Rebalancing design is well described in https://issues.apache.org/jira/secure/attachment/12368261/RebalanceDesign6.pdf.
As described earlier, when a client is writing data it is first written to a local file. When the local file accumulates a full block of user data, the client retrieves a list of DataNodes (depending on the replication factor) from the NameNode. The client flushes the data block to the first DataNode. The first DataNode starts receiving the data in small portion (4 KB), write it to the local repository and transfers that portion to the second DataNode. The second DataNode in turn writes to its local repository and transfers to the third DataNode. Thus, the data is pipelined from one DataNode to the next for making replicas.
NameNode detects the absence of a Heartbeat message from a DataNode and marks such DataNodes as dead and does not forward any new IO requests to them. The NameNode constantly tracks the block to be replicated and initiates re-replication whenever necessary. Re-replication may be needed for various reasons: a DataNode becoming unavailable, replica getting corrupted, a disk failure on a DataNode, or replication factor of a file getting increased.
A block of data fetched from a DataNode may be corrupted because of faults in a storage device, network faults or buggy software. HDFS can detect this using the checksum checking on the contents of the file. When a client creates an HDFS file, it computes a checksum of each block and stores these in a separate hidden file in the same namespace. When a block is fetched, the client verifies that the data it receives matches the checksum already stored. If it is doesn’t match, the data is assumed to be corrupted. The client can opt to retrieve that block from another replica.