Key-value Pair (KVP) – the backbone of NoSQL
Most NOSQL databases- BigTable, Hadoop, CouchDB, SimpleDB, memcached, Redis – use key-value pair (KVP) concept. A key-value pair (KVP) is a set of two linked data items:
1. a key (say account number or part number), which is a unique identifier for some item of data, and
2. the value, which is either the data that is identified or a pointer to the location of that data.
Key/Value stores have been there for a long time – and Unix’s dbm, gdbm and Berkley DB are key/value stores. Key-value pair concept has been frequently used in traditional RDBMS applications for lookup tables, hash tables and configuration files.
KVP has an important restriction, namely being able to access results by key alone. This restriction results in huge performance gains, massive speed improvements enabling partition of data over multiple nodes without impacting the ability to query each node in isolation.
As moving away from normalization meant compromising on consistency, using KVPs with key based access only restriction compromises on the rapid retrieval capability provided by relational databases and makes reporting (especially ad hoc ones) difficult. Continued »
Data Replication – the key to distributed applications
Data Replication, copying of data in databases to multiple locations, is an essential tool to support distributed applications (where the applications are closer to the line of business). As distributed data storage become more prevalent in large enterprises, the need for maintaining an increasing number of data copies to support timely local response would keep increasing.
Replication provides the benefits of fault tolerance in distributed computing environment, whereby the system can continue to function even when a part of the environment is down. Replication typically provides users with their own local copies of data. These local, updatable data copies can support increased localized processing, reduced network traffic, easy scalability and cheaper approaches for distributed, non-stop processing.
Data replication brings in the challenge of how to maintain the consistency and integrity of the data. With more nodes (or machines) in the cluster, the tightly coupled two-phase commit process becomes impractical. The new replication approaches decouples the applications from the underlying multiple data copies and allows the applications to proceed. Continued »
Eventual consistency gives scalability and availability
Success of high-volume web sites such as Google, Amazon, Twitter and Facebook in using NoSQL to achieve massive parallelism, unlimited scalability and high availability has fueled the interest. Consistency – the basic feature of relational database – is no longer the key. Trading off on consistency enables higher levels of scalability and availability and the new generation websites are willing to do so. Amazon claims that just an extra one tenth of a second on their response times will cost them 1% in sales. Google said they noticed that just a half a second increase in latency caused traffic to drop by a fifth.
The trend can be understood by looking at what Amazon CTO says “Each node in a system should be able to make decisions purely based on local state. If you need to do something under high load with failures occurring and you need to reach agreement, you’re lost. If you’re concerned about scalability, any algorithm that forces you to run agreement will eventually become your bottleneck. Take that as a given.”
To be fault tolerant and provide concurrency to millions of users, the data is duplicated in multiple copies. This brings an issue of how to make them consistent. As CAP theorem states, the more you relax your consistency requirement, the more you can gain on availability and partition tolerance. Continued »
Database Sharding – Horizontal Partitioning for massive scalability
Database Sharding (made popular by Google) is a method of horizontal partitioning in a database and the approach is highly scalable and provides improved throughput and overall performance. Database Sharding can be defined as a “shared-nothing” partitioning scheme for large databases across a number of servers each with their own CPU, memory and disk. Simply put, break the database into smaller chunks called “shards” (each becoming a smaller database by itself) and spread those across a number of distributed servers.
The main advantage of Database Sharding approach is improved scalability, growing in a near-linear fashion as more servers are added to the network. Continued »
Google BigTable – distributed data storage
Google’s data volumes are in petabytes and the typical requirement can be translated to joining two tables where tables are distributed over 100.000 nodes, and relational databases are not the right fit for them. Google’s motivation for developing its own solutions is driven by its need for massive scalability, better control of performance characteristics, and ability run on commodity hardware so that each new service or increase in load result in a small incremental cost.
Google has developed BigTable (a data organization), built on top of Google’s other services – specifically GFS, Scheduler, Chubby Lock Service, and MapReduce – that has been in active use since February 2005. These data models are simple and give the users dynamic control over data layout and format. Continued »
Google File System (GFS) – massively parallel and fault tolerant distributed file system
Google File System (GFS) is optimized for handling Google’s core data storage and usage needs involving generating and retaining enormous amount of data. GFS is designed to manage significantly large files using a large distributer cluster of commodity servers connected by a high speed network. It is designed to expect and tolerate hardware failures even while reading/writing the file and support parallel reads, writes and appends by multiple client programs.
GFS splits large files into chunks of 64MB that are extremely rarely overwritten or shrunk, but typically appended to or read. These chunks are stored in cheap commodity servers (also nodes) called chunk servers and hence necessitated the design to take precautions against high failure rate of individual nodes and the subsequent data loss. Another design decision is to go for high data throughputs, even if it comes at the cost of latency. Continued »
Some more Titbits on CouchDB
In an earlier blog http://itknowledgeexchange.techtarget.com/enterprise-IT-tech-trends/couchdb-nosql-for-document-based-applications/ covered some technical details behind CouchDB. In this, just listing down some interesting points I could gather about this NOSQL database. Continued »
MapReduce – Programming Paradigm for massively parallel computations
Most NOSQL database use the MapReduce programming model developed at Google, aimed at implementing large scale search and text processing on massive web data stored using BigTable and GFS distributed file system. MapReduce in turn relies on Key/Value pair concept.
Typically Key-value pairs has an important restriction, namely being able to access results by key alone compromising on the rapid retrieval capability and ad hoc reporting.
It is important to keep this restriction in mind while look at the possibility of any of the NoSQL databases. Nevertheless, MapReduce has proven its capability for handling large volumes of data using parallel computations on thousands of commodity machines. Continued »
CouchDB – NOSQL – for Document based applications
CouchDB – as the name suggests- has relax as the byline to CouchDB’s official logo, and when you start CouchDB it says: “Apache CouchDB has started. Time to relax“!
CouchDB creators found existing database tools too cumbersome to work with during development or in production, and decided to focus on making CouchDB easy, even a pleasure, to use.
Like most of the NoSQL, CouchDB relies on the Brewer or CAP theorm that states that it is impossible for a distributed computer system to simultaneously provide all three guarantees – Consistency, Availability and Partition Tolerance. Continued »
Commoditization – the next logical step for IT
Business-IT Alignment has long been recognized as the key for effective use of IT and we have gone through various stages – Standardization, Rationalization, COTS, ERP and SOA – to achieve this and it still remains elusive.
Cloud computing is the next panacea looked upon by all Enterprises. The success of cloud in SMB has been very encouraging and it would be indeed a mistake for Enterprises not to look at this option to achieve their objectives. Pay as you use (that allows effective use of resources in a cost sensitive fashion) and the seemingly unlimited availability of resources (allowing unlimited scalability and ability to handle unexpected spikes) are the benefits that cloud has to offer. Continued »