MapReduce replacing complex SQL queries
Most NoSQL databases – like CouchDB, MongoDB, Hadoop, Redis – support MapReduce, the programming paradigm for parallel computing pioneered (and also patented!) by Google. Support for MapReduce by the traditional database products, is increasing every day. MapReduce is fast turning out to be the common factor between the traditional database products and the new NoSQL movement.
The possibility of applying the MapReduce model for large scale, fault tolerant computations in suitable applications in the enterprise context is being explored with keen interest. Hadoop is an open source implementation of the MapReduce model and is available on pre-packaged AMIs in the Amazon EC2 cloud platform.
Google points out that MapReduce is a powerful tool that can be applied for a variety of purposes including distributed grep, distributed sort, web link-graph reversal, term-vector per host, web access log stats, inverted index construction, document clustering, machine learning and statistical machine translation. A much longer list of MapReduce applications is available at http://www.dbms2.com/2008/08/26/known-applications-of-mapreduce/.
Traditional databases are providing MapReduce support in addition to the standard SQL interface. While the DBAs are expected to continue using SQL, more and more developers are using MapReduce instead of complex SQL queries. As Curt Monash, President of Monash Research points out “Companies that are integrating MapReduce and SQL are increasing its applicability and giving developers and DBAs the ability to work together on a common parallel data processing infrastructure”.
Considering that MapReduce excels in aggregation and computation, data warehousing and business intelligence are the first to adopt MapReduce. A very interesting article on how MapReduce is relevant to Data Warehousing products is available at http://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousing/.
Greenplum, the leading data warehousing and analytics software is one of the early adopters of MapReduce in traditional database products. Greenplum’s parallel dataflow engine (the heart of the Greenplum database) can execute both MapReduce and SQLs. A primary benefit of the this capability is that customers can combine SQL queries and MapReduce programs into unified tasks that are executed in parallel across hundreds or thousands of cores. This combination of MapReduce for programmers and SQL for DBAs is much appreciated by the client who point out that with MapReduce it is now possible to replace complex SQL queries in a few line of Perl or Python code.
Aster Data, another leading data warehousing software provides in-database MapReduce also referred to as SQL/MR. Aster Data nCluster now includes a column data store with a unified SQL-MapReduce framework on a hybrid row and column massively parallel processing (MPP) database. Aster Data’s suite include more than 1,000 MapReduce-ready analytic functions that promises high performance analytics on large data volumes where data can be stored in either a row or column format. According to Aster Data, with hybrid row and column stores and SQL/MR support, the sky is really the limit for anyone to build powerful analytic apps.
IBM’s new portfolio of products, M2 (the enterprise data analysis platform), InfoSphere BigInSights (Visualization of Big Data) are powered by Hadoop MapReduce. IBM is also expected to improve Netezza by including Hadoop MapReduce distribution for the parallel processing of large amounts of information or complex data types on hardware clusters.
Oracle website provides a writeup at http://www.oracle.com/technetwork/database/features/bi-datawarehousing/twp-indatabase-mapreduce-128831.pdf on how to implement Map-Reduce Programs within the Oracle database using Parallel Pipelined Table Functions and parallel operations.
Microsoft research code named Dryad is investigating programming models for writing parallel and distributed programs to scale from a small cluster to a large data-center. According to Microsoft, Dryad is a general-purpose distributed computing engine, more flexible than MapReduce or Hadoop designed to simplify the task of implementing distributed applications on clusters of Windows computers.
Skynet is an open source Ruby implementation of Google’s MapReduce framework, created at Geni. Skynet is an adaptive, self-upgrading, fault-tolerant, and fully distributed system with no single point of failure.
FileMap provides file based MapReduce support. FileMap is a lightweight system, for applying unix-style file processing tools to large amounts of data stored in files.
Amazon Elastic MapReduce is a web service that enables businesses, researchers, data analysts, and developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop framework running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3).
Disco is an open source distributed computing framework based on MapReduce paradigm. Disco includes tools to index billions of data points and query them in real-time. Disco can be installed on a laptop, cluster or cloud.
MapReduce is fast turning out to be the common skill to be had by most developers and irrespective of the database in your enterprise, there are various options by which the skill can be picked up and in the process a complex problem also sorted out.