Apache Cassandra – Distributed Database – Part II
As seen in consistency, Cassandra offers the users the choice of synchronous and asynchronous replication. Reducing replication factor is easy as it only requires running cleanup afterwards removing extra replicas. Highly available asynchronous operations are optimized with features like:
- Hinted Handoff – If a node which should receive a write is down, a hint will be written to a live replica node indicating that the write needs to be replayed to the unavailable node. If there are no live replica nodes for the key the coordinating node will write the hint locally. This reduces the time required for a temporarily failed node to become consistent again. Hinted Handoff is a way by which Cassandra provides extreme write availability when consistency is not required. It is not sufficient for the ConsistencyLevel requirements of ONE, QUORUM or ALL (because these guarantee Consistency and writing somewhere doesn’t meet that requirement).
- Read Repair – When a query is made against a key, the query is performed against all the replicas of the key. If the ConsistencyLevel specified is low, then this is done in the background after returning the data from the closest replica to the client.
Conflict resolution is made during the read requests by way of “read repair”. Conflict resolution is based on timestamp specified by the client when you insert the row or the column. The higher timestamp wins and the node you are reading the data is responsible for that. Thus, all Cassandra clients’ need to be synchronized (based on an NTP for instance) in order to ensure the resolution conflict be reliable. The current version of Cassandra does not provide a Vector Clock conflict resolution mechanisms (should be available in the version 0.7).
Availability and Fault Tolerance
High availability is achieved using replication and actively replicating data across data centers is recommended. Since eventual consistency is the mantra of the system, reads execute on the closest replica and data is repaired in the background for increased read throughput.
Cluster membership is maintained via Gossip style (computer-to-computer communication protocol inspired by the form of gossip seen in social networks) membership algorithm. Failures of nodes within the cluster are monitored using an Accrual Style Failure Detector. A family of failure detectors, called accrual failure detectors, is defined whereby each monitoring process associates to each of the monitored processes, a real number that changes over time. The value represents a suspicion level, with zero implying that the process is not suspected at all and larger the value, the stronger the suspicion. Unlike binary failure detectors, accrual failure detectors leave the task of interpreting the suspicion level to applications.
Failure detection comprises three basic tasks:
1. Monitoring -Allows the failure detector to gather information about other hosts and their processes. Typically done by samplingb heartbeat arrivals or query-response delays.
2. Interpretation – Makes sense of the information obtained through monitoring.
3. Actions – Executed as a response to triggered suspicions
If a node goes down entirely, it is recommended that you bring up the replacement node with a new IP address and AutoBootstrap set to true in storage-conf.xml. This will place the replacement node in the cluster in the appropriate position. Then bootstrap process begins and once done the node is ready to receive reads. Remove the token of the dead node by running “nodetool removetoken” once. Also run “nodetool cleanup” on each node to remove old Hinted Handoff writes stored for the dead node. Alternative, is to bring up a replacement node with the same IP and taken as the old one and run “nodetool repair”.
Cassandra version 0.6 onwards authenticates (using login function) with the cluster for operation on the specified keyspace using the Authentication Request credentials. If the credentials are invalid, it throws Authentication Exception and if the credentials are valid, but not for the specified keyspace, it throws Authorization Exception.
Other Points on Cassandra
- Cassandra was started by Facebook in 2007 and open-sourced in 2008.
- Cassandra was a prophetess in Troy during the Trojan War whose predictions were always true but never believed!
- Cassandra is in production use at Digg, Facebook, Twitter, Reddit, Rackspace, Cloudkick, Cisco, SimpleGeo, Ooyala, OpenX, and companies that have large, active data sets. The use cases are in http://wiki.apache.org/cassandra/CassandraUsers
- Cassandra’s largest production cluster is said to have over 100 TB of data in over 150 machines.
- The Use cases in the website are still incomplete (http://wiki.apache.org/cassandra/UseCases)
- Riptana provides professional Cassandra support and services. Similarly ONZRA provides enterprise grade Cassandra architecture and development services (these are not endorsed by Apache).
- Several heavy users of Cassandra deploy in the cloud, e.g. CloudKick on Rackspace Cloud Servers and SimpleGeo on Amazon EC2.
- Rackspace’s VMs offer better performance for Cassandra because of CPU bursting, raided local disks, and separate public/private network interfaces.
Cassandra provides a structured key-value store with eventual consistency. Keys map to multiple values, which are grouped into column families. Columns are added only to specified keys, so different keys can have different numbers of columns in any given family. The values from a column family for each key are stored together, making Cassandra a hybrid between a column-oriented database and a row-oriented store.
Cassandra has interfaces for Ruby, Python (Telephus and Pycassa) and, Scala. Thrift is the Cassandra driver-level interface that these clients build on. Hadoop map/reduce jobs can retrieve data from Cassandra. The StorageProxy API (Thrift calls translate to this) is available to JVM-based clients and could provide some efficiency gains. But it is not recommended as backward compatibility is not guaranteed. Cassandra runs on a JVM and exposes JMX properties. Monitoring information can be collected using jConsole or any JMX compliant tool.
Cassandra provides elastic scaling as read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications. Cassandra is aptly described as a BigTable data model running on an Amazon Dynamo-like infrastructure.