Using Memcached for speeding up web sites and reducing database load
Memcached is a open source, high-performance, distributed memory object caching system (not a database like MemcacheDB) intended for use in speeding up dynamic web applications by alleviating database load. It caches data and objects in RAM so as to reduce the number of times an external data source (a database or API) must be accessed.
The key components are:
- Client software, which is given a list of available Memcached servers.
- A client-based hashing algorithm, which chooses a server based on the “key” input.
- Server software, which stores the values with their keys into an internal hash table.
- Server algorithms, which determine when to throw out old data (if out of memory), or reuse memory.
Memcached is, by default, a Least Recently Used cache plus expiration timeouts. It is designed to have items expire after a certain amount of time (up to 30 days). If the server runs out of memory, it first looks for expired items to replace, and then the least used items so that the frequently requested information can be retained in memory. So Memcached should be treated as a transitory cache.
Memcached uses a lazy expiration and doesn’t use extra CPU just for expiring items. When an item is requested (a get request) it checks the expiration time to see if the item is still valid before returning it to the client. “Forgetting Data” is a feature of Memcached that simplifies how it works. There are no pauses waiting for a garbage collector and free space is reclaimed as and when needed.
The cache can be seamlessly integrated with the application by ensuring that the cache is updated at the same time as the database. While updating an object in the database, you can either set the new data into Memcached (preferred) or send a delete to remove old data from Memcached.
Memcached is a simple key/value store (btw, it is classified as a NoSQL) where items are made up of a key, an expiration time, optional flags, and raw data. The server does not understand data structures and the data uploaded must be pre-serialized. Some commands (incr/decr) may operate on the underlying data, but the implementation is simplistic.
Keys with Null or Boolean values have to avoided as in such cases memcached doesn’t work as expected. Using sprint() or similar function, when creating keys is an option. Memcached doesn’t support namespaces. To avoid key collision between different types of data, prefixing your key with a useful string is recommended (e.g.,”user_12345″, “article_76890”).
Memcached is not for large media and streaming huge blobs. Keys are up to 250 bytes long and values can be at most 1 megabyte large. Shorter keys are recommended as they save memory and use less bandwidth.
Memcached behaves like a giant hash table, looking up key = value pairs. Given a key, you can set or get some arbitrary data. All individual commands sent to Memcached are atomic. If an object gets a set and get request in parallel, they are serialized and executed one after another.
But a series of commands is not atomic. If you ‘get’ an item, modify it and then try to ‘set’ it back, you are not guaranteed to be the only process working on that value. You could end up overwriting a value set by some other process – in between your get and set.
In version 1.2.5 and higher, there are the “gets” (get a set) and “cas” (check and set) commands that can be used to deal with this. Issuing a “gets” command to fetch a key, gives you a unique identified back with that value. Sending the identifier back with the “cas” command for overwriting the data ensures that the write is successful only if identifier stored in Memcached is same as the one you supplied. If another process had modified the key in the meantime, the identifier would have changed and your write would fail. In general, updating memcached based on data in memcached is considered tricky.
Compared to local cache (where each server has its own cache), in Memcached, all of the servers are looking into the same virtual pool of memory. This means that a given item is always stored and always retrieved from the same location in the entire web cluster.
Memcached uses a two-stage hash approach:
- When doing a lookup, the client hashes the key against the whole list of servers and locates the server where the data is located. Each client knows all the servers and uses the same hashing process – ensuring that for the given key the same server is contacted. For invalidating a cache entry, clients can direct in on the exact location of data to be invalidated (instead of broadcasting data to all available hosts).
- The server hashes the key again to determine where to store or read the corresponding value. Memcache’s hashing algorithm does not take into account memory sizes across servers. A workaround for load balancing is to run multiple memcached instances on the server with more memory (maintaining the same size cache for all instances).
Memcached servers are generally unaware of each other. There is no crosstalk, no synchronization, no broadcasting. The lack of interconnections means adding more servers will usually add more capacity as needed.
Memcached is called a non blocking server as it doesn’t support anything that could cause the server to pause and not respond to requests momentarily. All of Memcached non-debug commands – add, set, get, flush – take near constant time to execute every time no matter how much data is in the cache (also referred to as O(1) executions).
Iteration on a set of items is not supported as any command which needs to iterate cache items will get progressively slower as there is more data in the cache and other commands other commands get blocked (waiting for the slow iterative command to finish). Memcached does not support any type of wildcard deleting.
Similarly loading the cache from a dump is not supported. To avoid empty caches during peak hours, “warming” your cache is advised. Warming methods include a script to walk the website and cache common pages, a command lien tool that run through the list of users online and filling caches based on the same.
Memcached doesn’t have any data redundancy as it is meant to be only a caching layer. If the node loses all its data, the application should be able to fetch it from the original data source. Applications should be designed to be able to handle losing Memcached instances.
Memcached doesn’t do anything in case of Memcached node failure (as there is no central authority) and handling failover is left up to the user. Removing the dead nodes from the list of servers is not a good option, as the clients hashing against the list would get different servers (it is like restarting all the nodes at the same time). Other options are:
- Start up a hotspare with the same IP address as the dead node (this also prevents hashing chaos)
- Use Consistent hashing algorithms to add and remove servers to the mix. Since the client does one layer of hashing, it becomes trivial to add dozens of nodes to the cluster.
Memcached doesn’t have an authentication mechanism which is what makes the clients and server lightweight (new connections are fast and server configuration is nonexistent). To restrict access, we need to use a firewall or have Memcached listen via Unix domain sockets. If running in an insecure environment, Memcached is not recommended as anyone can just telnet to any Memcached server. For applications deploying Memcached in untrusted networks, to exercise control over the clients, they can optionally comply with SASL.
Using Memcached for developing new application(s) is quite simple and straight-forward. It is clearly stated that “Memcached is a developer tool, not a “code accelerator, nor is it database middleware”. Adding memcached support to an existing application can be a lot of work. Setting up Memcached for a third party application (purchased or downloaded) is not possible, unless the product owner has explicitly provided the capability.
While Memcached gives massive volume of cached data in a high speed farm, local cache would make sense for very small amount of data that gets accessed on nearly even page load. Using a cache hierarchy involving localized cache and Memcached is possible. Local caches can help lower page rendering time as well as improve reliability in case of node failures.
Memcached is currently used by a large number of websites including LiveJournal, Wikipedia, Flickr, Bebo, Twitter, Typepad, Yellowbot, Youtube, Digg, WordPress, Craigslist, and Mixi.