Google BigTable – distributed data storage
Google’s data volumes are in petabytes and the typical requirement can be translated to joining two tables where tables are distributed over 100.000 nodes, and relational databases are not the right fit for them. Google’s motivation for developing its own solutions is driven by its need for massive scalability, better control of performance characteristics, and ability run on commodity hardware so that each new service or increase in load result in a small incremental cost.
Google has developed BigTable (a data organization), built on top of Google’s other services – specifically GFS, Scheduler, Chubby Lock Service, and MapReduce – that has been in active use since February 2005. These data models are simple and give the users dynamic control over data layout and format.
Each table is a sparse, distributed, persistent multi-dimensional sorted map. The map is indexed by a row key (up to 64KB in size), column key and a timestamp. Value in the map (or data) is treated as uninterpreted array of bytes (though clients can serialize various forms of structured and semi-structured data into these).
The table maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned as tablets of approximately 100-200 MB. Each machine stores about 100 tablets or so. This makes reads of short row ranges efficient as they involve only a small number of machines. Similarly the tablets allow fast rebuilding and fine grain load balancing.
Column stores arbitrary name-value pairs in the form of column-family: label, string. The possible set of column families for a table is fixed at the time of table creation. The number of distinct column families in a table is expected to be small and rarely change. The actual columns (i.e., labels) within the column family can be created dynamically at any time.
Similar to column oriented databases, column families are stored close together resulting in efficient data access. The sales analysis may read only data pertaining of location column family while the market analysis can used only the product column family. Access control, and disk and memory accounting are done at the column-family level.
Each cell (row, column) can contain multiple versions of the data, indexed by timestamps. The data is stored in decreasing timestamp order so that the most recent can be read first. Timestamp when assigned automatically represent real time in microseconds and applications that need to avoid collisions must generate unique timestamps themselves. Using Automatic garbage collection feature of cell versions, the client can specify that only the last n versions of a cell or only the new-enough versions (say last d days) be retained.
BigTable implementation involves three major components: a library linked to every client, one Master server and many tablet servers (that can be added dynamically). The Master is responsible for assigning tablets to tablet servers, detecting addition and expiration of tablet servers, tablet-server load balancing, garbage collection, handling schema changes like table and column family creations. Each tablet server manages the read and requests of a set of tablets (ten to thousand tablets per server).
BigTable uses a three-level hierarchy similar to that of a B+ tree to store tablet location information.
1. The first level is a file containing the location of the root tablet (1st metadata tablet that is never split), stored in Chubby.
2. The root tablet contains the locations of all tablets in a special METADATA table
3. Each Metadata tablets contain the location of a set of user tablets (each metadata row is around 1KB).
The clients do not move through the Master, and the client library caches tablet locations and also does the prefetch of tablet locations (reads metadata for more than one tablet). This results in clients not having to rely on the master for tablet location ensuring that the master is lightly loaded in practice and do not turn out to be a bottleneck.
Functions for creating and deleting tables and column families are provided by APIs. APIs also provide functions for changing cluster, table and column family metadata. Applications can write or delete values, look up values for individual rows, or iterate a process over a subset of the data. Single-row transactions that support atomic read-modify-write on data stored under a single row key are provided.
Currently, it does not support general transactions across row keys. Instead an interface that enables batching writes across row keys is provided. The table can be used in conjunction with MapReduce framework for running large-scale parallel computations. The wrappers that allow Bigtable to be used both as an input source and an output target for MapReduce are provided.
In addition the execution of client-supplied scripts in the address spaces of the servers are supported. For Bigtable, the Sawzall (language developed at Google) based scripts allow various forms of data transformation, filtering and summarization. At present, the client scripts are not allowed to write back into Bigtable.
The Google SSTable format is used internally to store Bigtable data. An SSTable can also be completely mapped into memory which allows lookups and scans without touching disk. Updates are committed to a commit log (that stores redo records). Of these, the recently committed ones are stored in memory in a sorted buffer called a memtable. The older updates are stored in a sequence of SSTables. In effect, the persistent state of a tablet is stored in GFS. A read operation is executed on a merged view of the sequence of SSTables and the memtable.
The only mutable data structure that is accessed by both reads and writes is the memtable. The master removes obsolete SSTables as a mark-and-sweep garbage collection over the set of SSTables, where the METADATA table contains the set of roots.
Tablet servers use two levels of caching – Scan Cache and Block Cache – to improve read performance. Scan Cache is useful for applications that tend to read the same data repeatedly. Block Cache is for applications that tend to read data that is close to the recently used data (e.g., sequential reads).
There is a single commit log per tablet server, where the mutations for different tablets are mingled in the same physical log. Using one log provides significant performance benefits – by reducing number of disk seeks, effective group commit optimization. The trade-off is that it complicates recovery (when a tablet server dies and the tablets are moved to a large number of servers) for single tablet.
Clients can control whether the SSTables are compressed and also the compression format used. Compression is applied to each SSTable block separately – though is not ideal for saving space – enables reading of small portions of an SSTable without decompressing the entire file.
BigTable is now used by a number of Google applications – Google Reader, Google Maps, Google Book Search, “My Search History”, Google Earth, Blogger.com, Google Code hosting, Orkut, YouTube and Gmail.
BigTable is currently not distributed or used outside of Google, although Google offers access to it as part of their Google App Engine. NoSQL databases like Google’s Datastore, Amazon’s SimpleDB, Apache’s Cassandra, Kosmix’s KDI, HBase for Hadoop are based on similar data models.