MongoDB, the NoSQL favourite – Part 1
MongoDB seems to be the favourite choice when NoSQL is considered – mainly due to it being simple and more close to object oriented concepts and relational database usage. MongoDB aims to bridge the gap between key-value stores (which are fast and highly scalable) and traditional RDBMS systems (which provide rich queries and deep functionality).
MongoDB is an open source, scalable, high-performance, schema-free, document-oriented database written in the C++ programming language. In MongoDB, a database consists of one or more collections, the documents in those collections, and an optional set of security credentials for controlling access. MongoDB uses type-rich BSON as the data storage and network transfer format for “documents”. In addition to the basic JSON types of string, integer, boolean, double, null, array and object, BSON types include date, object id, binary data, regular expression and code.
Though the documents usually have the same structure, it is not required, as it is a schema-free database. Data within a Mongo collection tends to be contiguous on disk making table scans of the collection possible and efficient. Creating multiple collections gives performance benefits and also provides an option of “repeating data” to be in a separate collection and referenced by other objects.
In MongoDB, typically needs no or less normalization. Typically there is one collection for each of the top level objects and the other objects can be embedded. Let us take an example, where there are students, address, courses and scores. In a relational mode, each of them would have been a table with foreign keys. In MongoDB, there can be just two collections: students and courses. Student document can embed address and “score” documents which refer to the course. These self-contained documents imply that the data is then co-located on disk, and the turnarounds to the database are eliminated. The alternative to embed is to reference the objects. Each reference traversal would be a query to the database. So if performance is an issue, embed.
Some thumb rules on when to reference are:
- Many- to-many relationships are generally by reference.
- Embedded objects are harder to reference than “top level” objects in collections, as you cannot have a DBRef (reference between documents) to an embedded object.
- It is more difficult to get a system-level view for embedded objects. For example, it would be easier to query the top 100 scores across all students if scores were not embedded.
- If the amount of data to embed is huge (many megabytes), the limit on size of a single object may be reached.
One of MongoDB’s best capabilities is its support for dynamic (ad hoc) queries. It is done effectively using indexes which are conceptually similar to those in RDBMSes. Once a collection is indexed on a key, random access on query expressions which match the specified key are fast. An index is always created on _id which enforces uniqueness for its keys. MongoDB supports secondary indexes, including single-key, compound, unique, non-unique and geospatial indexes. With MongoDB we can even index on a key inside of an embedded document. The MongoDB profiling facility provides useful information for where an index should be added that is missing.
MongoDB provides a “multikey” feature that can automatically index arrays of an object’s values. If there is an article tagged with category names, we can index on the tag array resulting in the database indexing each element of the array. Then, we can easily query for a particular tag value.
Each index created adds a certain amount of overhead for inserts and deletes as in addition to writing data to the base collection, keys must then be added to the B-Tree indexes. Thus, indexes are best for collections where the number of reads is much greater than the number of writes. We can use sort() to return data in order without an index if the data set to be returned is small (< 4 MB).
MongoDB’s query optimizer tries to select the fastest query plan. We can see the index being used with the ‘explain’ function and choose a different index with the ‘hint’ function.
In addition to supporting indexes and queries similar to SQL (this being the key differentiator when compared to CouchDB), MongoDB also supports MapReduce functions. MapReduce is useful for batch manipulation of data and aggregation operations and is recommended in situations when we would have used GROUP BY in SQL.
We can create a capped collection, by specifying the collection size (including headers) in bytes. The collection data space is then pre-allocated. Collection can also be capped by the number of objects. Once the limit is reached, items roll out on a least recently inserted basis. In capped collection, the natural order is guaranteed to be the insertion order. The natural order feature is a very efficient way to store and retrieve data in insertion order (much faster than the timestamp indexing).
Capped collections have a very high performance auto-FIFO age-out feature (age out is based on insertion order). Capped collections provide a high-performance means for storing logging documents in the database. With LRU mechanism built-in, the risk of using excessive disk space for logging is avoided. A capped collection provides a simple way of auto-archiving, where the data automatically “rolls out” over time as it ages.