Posted by: Michael Tidmarsh
IT Books, NoSQL
The following is an excerpt from the book Making Sense of NoSQL from Manning Publications.
By Dan McCreary and Ann Kelly
The four main patterns—key value store, graph store, Bigtable store, and document store—are the major architecture patterns associated with NoSQL. As in most things in life there are always variations on a theme. In this article, part of Making Sense of NoSQL, the authors discuss a representative sample of the types of pattern variations and how they can be combined to build NoSQL solutions in organizations.
We’re giving away a free copy of Making Sense of NoSQL to one lucky ITKE member. Share your data management story with us and we’ll pick the most compelling tale.
Variations of NoSQL Architectural Patterns
In this article, we will look at how each of the NoSQL patterns—key-value store, graph store, Bigtable store, and document store—can be varied by focusing on a different aspect of system implementation. We’ll look at how the architectures can be varied to use RAM or solid state drives (SSD) and then talk about how the patterns can be used on distributed systems or modified to create enhanced-availability. Finally, we’ll look at how database items can be grouped together in different ways to make navigation over many items easier.
Customization for Random Access Memory (RAM) or Solid State Drive (SSD) Stores
Some NoSQL products are designed to specifically work with one type of memory; for example, memcache, a key-value store, was specifically designed to see if items are in RAM on multiple servers. A key-value store that only uses RAM is called a RAM cache; it’s flexible and has general tools that application developers can use to store global variables, configuration files, or intermediate results of document transformations, A RAM cache is fast, reliable, and can be thought of as another programming construct like an array, a map, or a lookup system. There are, however, several items that should be considered:
- Simple RAM resident key-values stores are generally empty when the server starts up and can only be populated with values on demand.
- You need to define the rules about how memory is partitioned between the RAM cache and the rest of your application.
- RAM resident information must be saved to another storage system if you want it to persist between server restarts.
The key is to understand that RAM caches must be re-created from scratch each time a server restarts. A RAM cache that has no data in it is called a “cold-cache” and is a good reason why some systems get faster the more they are used after a reboot.
SSD systems provide permanent storage and are almost as fast as RAM for read operations. The Amazon DynamoDB key-value store service uses SSD for all its storage resulting is very high-performance read operations. Write operations to SSD can often be buffered in large RAM caches resulting in very fast write times until the RAM becomes full.
As you will see, using RAM and SSD drives efficiently is critical when using distributed systems that provide for higher volume and availability.
In this part, we look at how NoSQL data architecture patterns vary as you move from a single processor to multiple processors that are distributed over data centers in different geographic regions. The ability to elegantly and transparently scale to a large number of processors is a core property of most NoSQL systems. Ideally, the process of data distribution is transparent to the user, meaning that the API does not require you to know how or where your data is stored. However, knowing that your NoSQL software can scale and how it does this is critical in the software selection process.
If your application uses many web servers, each caching the result of a long-running query, it is most efficient to have a method that allows the servers to work together to avoid duplication. This mechanism is memcache. Whether you’re using NoSQL or traditional SQL systems, RAM continues to be the most expensive and precious resource in an application server’s configuration. If you don’t have enough RAM, your application won’t scale.
The solution used in a distributed key-value store is to create a simple, lightweight protocol that checks to see if any other server has an item in its cache. If it does, this item is quickly returned to the requester and no additional searching is required. The protocol is simple; each memcache server has a list of the other memcache servers it is working with. Whenever it receives a request that is not in its cache, it checks with the other peer servers by sending it the key.
The memcache protocol shows that we can create simple communication protocols between distributed systems to make them work efficiently as a group.
This type of information sharing can be extended to other NoSQL data architectures such as Bigtable stores and document stores. We can generalize the key-value pair to other patterns by just referring to them as “cached items.”
Cached items can also be used to enhance the overall reliability of a data service by replicating the same items in multiple caches. If one server goes down, other servers quickly fill in so that the application gives the user the feeling of service without interruption.
To provide a seamless data service without interruption, the cached items need to be replicated automatically on multiple servers. If the cached items are stored on two servers and the first one becomes unavailable, the second server can quickly return the value; there is no need to wait for the first server to be rebooted or restored from backup.
In practice, almost all distributed NoSQL systems can be configured to store cached items on two or three different servers. The decision about which server stores which key can be determined by implementing a simple round-robin or random distribution system. There are many trade-offs about how loads can be distributed over large clusters of key-value store systems and how the cached items in unavailable systems can be quickly replicated onto new nodes.
NoSQL systems dominate many organizations that have large collections of data items and it becomes cumbersome to deal with these items if they can only be accessed in a single linear listing.
Web pages can be stored in a key-value store using a website URL as the key and the web page as the value. We can extend this construct to file systems as well. In a file system, the key is the directory or folder path and the value is the file content. However, unlike web pages, file systems have the ability to list all the files in a directory without having to open the files. If the file content is large, it would be inefficient to load all of the files into memory each time you want a listing of the files.
To make this easier and more efficient, a key-value store can be modified to include additional information in the structure of the key to indicate that the key-value pair is associated with another key-value pair creating a collection or general purpose structures used to group resources. While each key-value store system may call it something different (folders, directories, or buckets), the concept is the same.
The implementation of a collection system can also vary dramatically based on what NoSQL data pattern you use. Key-value stores have several methods to group similar items based on attributes in their keys. Graph stores associate one or more group identifiers with each triple. Big Data systems use column families to group similar columns. Document stores use a concept of document collection. Let’s take a look at some examples used by key-value stores.
One approach to grouping items is to have two key-value data types: the first called resources keys and the second a collection keys. We can use collection keys to store a list of keys that are in a collection. This structure allows you to store a resource in multiple collections and also to store collections within collections.
Using this design poses some complex issues that require careful thought and planning about what should be done with a resource if it is in more than one collection and one of the collections is deleted. Should all resources in a collection automatically be deleted?
To simplify this process and subsequent design decisions, key-value systems can include the concept of creating collection hierarchies and require that a resource be in one and only one collection. The result is that the path to a resource is essentially a distinct key for retrieval. Also known as a simple document hierarchy, the familiar concept of folders and documents resonate well with end users.
Once we have established the concept of a collection hierarchy in a key, we can use it to perform many functions on groups of key-value pairs, for example:
- Associate metadata with a collection (in other words, who created the collection, when it was created, the last time it was modified and who last modified the collection).
- Give the collection owner, group, and associate access rights to the owner group and other users in the same way UNIX file systems use permissions.
- Create an access control permission structure on a collection, allowing only users with specific privileges the ability to read or modify the items within the collection.
- Create tools to upload and/or download a group of items into a collection.
- Set up systems that compress and archive collections if they have not been accessed for a specific period of time.
If you’re thinking “That sounds a lot like a file system!” you’re right. The concept of associating metadata with collections is universal, and many file systems and document management systems use concepts very similar to key-value stores as part of their core infrastructure.
Each data architecture pattern is useful for classifying many commercial and open-source products and understanding their core strengths and weaknesses. The challenge is that many real-world systems rarely fit into a single category. They may start with a single pattern but then so many features and plug-ins are added they become difficult to put neatly into a single classification system. Many products that began as a simple key-value store have many features common to Bigtable stores. So it is best to treat these patterns as guidelines rather than rigid classification rules.