Apache Cassandra - Distributed Database – Part I
The Apache Cassandra is highly available, incrementally scalable, eventually consistent, distributed database, bringing together Amazon Dynamo’s fully distributed design and Google Bigtable’s Column Family-based data model.
Schemaless and Rich Data model
Cassandra is considered a schema-less datastore. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems. A column oriented model can be visualized as a key/value pair where the value is a collection of other Key/Value elements.
Keyspaces are the upper-most namespace in Cassandra and typically there will be exactly one for each application (similar to Database in RDBMS). For each keyspace there are one or more column families. A column family is the namespace used to associate records of a similar kind. A column family can be compared with a table in a RDBMS as they are fixed when a Cassandra database is created. But columns can be added to a family at any time and the number and names of the columns can vary from one row to another (schema-less!).
Cassandra offers record-level atomicity within a column family when making writes, and queries against them are efficient. Column family definitions also include a comparator which controls the order in which the records are sorted. Unlike relational Cassandra doesn’t support secondary indices enabling queries by field other than the key. The answer is to create our own inverted index that maps the secondary field (say username) to the primary key (say UUID), and that is the purpose of this column family.
Column is the basic element composed of a name, value and timestamp. The timestamp is set by the client and this has an architectural impact on clients’ clocks synchronization. A single column value may not be larger than 2GB
There is also a concept called SuperColumn which is in turn a column in which a dynamic list of columns can be stored. The notion of “Row” as in RDBMS doesn’t exist – but a list of columns identified by a row key under SuperColumns come closer to that.
Memtable and SSTable
Cassandra is suitable for applications that cannot afford to lose data even when the entire data centre goes down. Durability, the property that ensures that the write once completed will survive permanently typically requires calling “fsync” telling the OS to flush its write-behind cache to disk. If done for each write, this would be very slow and so not supported and writes to commit log instead.
Cassandra does not use b-tree and in-place updates on disk and uses the Memtable and SSTable similar to Google’s Bigtable. Cassandra writes are first written to the CommitLog (append only), and then to a per-ColumnFamily structure called a Memtable. A Memtable is a write-back cache of data rows that can be looked up by key. Unlike a write-through cache, writes are batched up in the Memtable until it is full, before being written to disk as an SSTable.
As writes incur no random I/O, Cassandra writes faster than the reads which has to merge row fragments from multiple SSTables on disk. This tradeoff is considered acceptable as scaling writes which is harder gets handled gracefully. The default value for concurrent reads is “8″ (thumb rule – 4 per processor core) and concurrent writes is “32″ (modifying this value is not recommended; in general it can exceed the number of cpu-cores on the ring).
Cassandra allows configuring commit Log to be synchronized periodically (default is 10000 milliseconds) and there is a potential loss of some data in case of a crash. Alternatively the batch mode (fully durable mode) can be selected where Cassandra guarantees that it synchronizes (i.e., the commit log has been fsynced to disk) before acknowledging writes. Batch mode will also wait for other writes (based on value of CommitLogSyncBatchWindowInMS) before performing the sync. For Batch mode, splitting the commit log to a separate device is recommended.
Clusters and Data Partitioning
Cassandra uses the peer-to-peer distribution model – which in turn drives the consistency model – ensuring that there is no single point of failure. Every node in the cluster is identical. There are no network bottlenecks.
The nodes in the cluster can be considered to be as in a ring and follows O(1). Cassandra lets the nodes of the cluster (and not the client) partitioning the data based on the row key. Each Cassandra server (node) is assigned a unique token that determines what keys it is the first replica for. When all tokens are sorted, the range of keys each is responsible for is (PreviousToken, MyToken. The machine with the lowest token gets all keys less than token as well as all keys greater than the largest token (referred to as “wrapping range”).
Cassandra mainly uses two different algorithms to distribute data over the nodes:
1. The RandomPartitionner that gives you an equally and hash-based distribution. The keys are converted to the token range (from 0 to 2127) by MD5 hashing and compared with the token of the node to choose the node. This is more suited for better load balancing as the key are more equally partitioned across the different nodes.
2. The OrderPreservingPartitioner that guarantees the Key are organized in a natural way. This facilitates the range queries as fewer nodes are hit to get a range of data. Tokens are UTF8 strings in the range of 1 to ∞.
If nodes are added to the cluster, the ring becomes unbalanced and only way to get perfect balance is to compute new tokens for every node and assign them to each node manually by using nodetool move command. Additionally the replica placement can be customized using IReplicaPlacement Strategy in the configuration file. In case of RackUnawareStrategy, replicas are always placed on the next (in increasing token order) N-1 nodes along the ring.
In case of RackAwareStrategy, replica 2 is placed in the first node along the ring that belongs to another data center than the first (for fault-tolerance) and the remaining N-2 replicas are placed on the first nodes along the ring in the same rack as the first. In this case, succeeding nodes along the ring should alternate data centers to avoid hot spots (say nodes A and C in DC1 and B and D in DC2). In other words, when a second DC is added then add as many nodes as you have in the first.
Like most NoSQL databases, Cassandra has chosen Availability and Partition tolerance over Consistency. Cassandra offers different levels of consistency that have different meaning based on whether it a read or write operation:
1. Zero – Applicable for Write ONLY. Guarantees nothing and the write happens asynchronously in background.
2. ANY – Applicable for Write ONLY. Ensure that the write has been written to at least 1 node. Unlike Option one, this includes even writing to HintedHandoff recipients (where the hinted write is not immediately readable).
3. ONE – Ensure that the write has been written to at least one replica’s commit log and memory table before responding to the client. During read, the data will be returned from the first node where it is found and this could result in older data getting returned as the node that is hit may not have the last version. A consistency check happens in the background so that any subsequent calls will have the correct data (Read repair).
4. QUORUM – Ensure that the data has been written to <Replication Factor>/2 + 1 replicas before responding to the client. The read returns the most recent record (timestamp) after reading <Replication Factor>/2 + 1 replicas ensuring that the data is consistent. The remaining replicas are checked in the background.
5. DCQUORUM – Similar to QUORUM but takes into account the rack aware placement strategy where the reads are kept within a data center.
6. ALL – Ensures that the write and read involve all the replicas. Any unresponsive replica will fail the operation.
If R=read replica count, W= Write replica count, N=replication factor, and Q (QUORUM) = N / 2 + 1;
You will have consistency, when W + R > N;
If W=Q and R=Q, then you have consistency and remaining replicas are checked in background
If W=1 and R=N (OR) W=N and R =1, you have consistency as either all replicas have the latest version or all replicas are read for finding the latest
If W=1 and R=1, you can get older data and Read Repair happens in the background.
Thus Cassandra use BASE and offers eventual consistency – the latest version is on some node in the cluster while the older versions are still there in other nodes and eventually replication happens so that all nodes see the latest version.