MapReduce – Programming Paradigm for massively parallel computations
Most NOSQL database use the MapReduce programming model developed at Google, aimed at implementing large scale search and text processing on massive web data stored using BigTable and GFS distributed file system. MapReduce in turn relies on Key/Value pair concept.
Typically Key-value pairs has an important restriction, namely being able to access results by key alone compromising on the rapid retrieval capability and ad hoc reporting.
It is important to keep this restriction in mind while look at the possibility of any of the NoSQL databases. Nevertheless, MapReduce has proven its capability for handling large volumes of data using parallel computations on thousands of commodity machines. When completed, MapReduce was used to completely regenerate Google’s index of the World Wide Web, and replaced the old ad hoc programs that updated the index and ran the various analyses. MapReduce is also used across a wide range of domains – large scale machine learning, clustering for Google News, extraction of web page properties for new products (like localized search) and experiments, extraction of data to produce reports for popular queries (Google Zeitgeist) – within Google.
Google keeps implementing special-purpose computations that process large amounts of raw data (crawled document, web request logs etc.) and derive various data (inverted indices, graph structure representations, summaries of pages crawled, most frequent queries at a given time etc.). These computations are actually straightforward and the complexity is due to the facts that
- the input data is very large and
- the results have to be obtained in a reasonable amount of time and
- the ability to use hundreds or thousands of machines in parallel is required to make it cost effective.
Parallel computing has been there for decades mainly used in scientific computing. Over years, the databases started exploiting the parallel computing capabilities of multi-processor servers as well computing power of clusters connected via high-speed network. Shared memory, shared disk and shared nothing are the three significant parallel database architectures. All the three traditional relational databases – Oracle, DB2 and SQL Server – support parallelism in various ways. Similarly specialized systems designed for data warehousing such as Teradata, Vertica and Netezza have parallel architectures involving closely coupled clusters with a few dozen processors.
With a limited number of processors involved in parallel computing, it was safe to assume that the processors would not fail during computation and in case of failure the control could be transferred to a ‘standby system’ and in the worst case of failure the computation can be restarted. But in case of Google a typical MapReduce computation processes many terabytes of data on thousands of commodity machines, and so the MapReduce that evolved to meet their needs included fault tolerant implementation that was transparent to the application developers.
MapReduce framework is inspired by the map and reduce primitives available in Lisp and other functional languages. The computations involve two operations – a map operation to each logical record in the input to compute a set of intermediate key/value pairs, and then applying a reduce operation to all the values that shared the same key to get the appropriate derived data.
The advantage of MapReduce is that it allows for distributed processing of the map and reduction operations (number of processors at each phase is configurable). If there are M mapper processors, each mapper processor typically handles 1/M th of the input, provided each mapping operation is independent of the other and hence can be performed in parallel. Similarly the reduction phase can be handled by a set of reducers and all that is required is that all outputs of the map operation which share the same key are presented to the same reducer, at the same time. Each Reduce invocation typically produces just zero or one output value. The intermediate keys are supplied to the reduce function via an iterator which enables handling of large list of values.
While this process appears inefficient compared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets – a large server farm can use MapReduce to sort a petabyte of data in only a few hours. This approach also enables recovering from partial failure of servers or storage during the operation, if one mapper or reducer fails, the work can be rescheduled assuming the input data is still available.
The standard example that is used in conveying this Map and Reduce operations is counting the number of occurrences of each word in a large collection of documents. The user written map function would emit each word plus an associated count of occurrences and the reduce functions sums together all counts emitted for a particular word.
The user writes the code in a mapreduce specification object with the names of the input, output and optional tuning parameters. Then the user invokes the MapReduce function, passing the specification object. The user’s code get linked together with the MapReduce library. MapReduce libraries have been written in C++, C#, Erlang, Java, Python, Ruby, F#, R and other programming languages.
Another example of MapReduce usage is Reverse Web-link Graph, where the map function outputs (target, source) pairs for each link to the target URL found in a page named source. The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair (target, list(source)).
Conceptually the map and reduce functions supplied by the user have associated types:
map (k1,v1) → list(k2,v2)
reduce (k2,list(v2)) → list(v2)
Range of applications that MapReduce is applied for include: “distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning, statistical machine translation…”.
MapReduce uses the master-worker node approach for distributing the map and reduce functions. MapReduce’s stable inputs and outputs are usually stored in a distributed file system. The transient data is usually stored on local disk and fetched remotely by the reducers. The master node attempts to schedule reduce operations on the same node, or in the same rack as the node holding the data being operated on thereby conserving the data center’s network bandwidth.
The master distributes the functions to the worker nodes and maintains the regular heartbeat communications with each worker. The master store the state (idle, in-progress, or completed) for each map task and reduce task along with the identity of the worker machine. If no response is received from a worker in a certain amount of time, the master marks the worker as failed. Any map tasks completed by the worker are reset back to their initial idle state (as the completed task outputs are stored on the local disks of the failed machine and hence not accessible), and therefore become eligible for scheduling on other workers. Similarly, any map task or reduce task in progress on a failed worker is also reset to idle and becomes eligible for rescheduling.
Google claims that the programmers find implementation of MapReduce as easy to use and hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.