The use of MongoDB
NoSQL has recently become a buzzword in web development. Some projects use similar databases for architecture bottlenecks, while others fully migrate to them. Such databases differ from relational ones in simple architecture and high scalability. At the moment we are interested in MongoDB as one of the most popular NoSQL databases. We’ll try to highlight its strengths and weaknesses, consider some peculiarities of development, its application field in Drupal, and will try to understand the Map/Reduce term through a practical example.
MongoDB is something between a key-value storage (which is usually fast and scalable) and traditional relational databases (MySQL, PostgreSQL, etc.) that provide advanced queries and rich functionality.
MongoDB (derived from "humongous") is a scalable, high performance, document-oriented database server with an open source, that is written in C ++. MongoDB features:
- Document-Oriented Data Storage
- JSON-style documents with dynamic schemas offer simplicity and power
- Full Index Support
- Index on any attribute (including embedded) just like we are used to
- Replication and High Availability
- Mirror across LANs and WANs for scale and peace of mind
- Auto-Sharding
- Scale horizontally without compromising functionality
- Querying
- Rich, document-based queries.
- Fast Data Updates
- Simple atomic operations
- Map/Reduce
- Flexible data aggregation
- GridFS
- Store files of any size without complicating your stack.
- Commercial Support
- Enterprise support, training, and consulting available
- First the necessary documents are selected from the database. Because this is essentially a usual sample query, general rules of optimizing similar operations, such as adding indexes, and limiting the number of selected data apply.
- A map function is created in JavaScript, which scans each document found in the previous step, and gathers information necessary for aggregation.
- Then reduce function is created in JS, which receives the mapping data, grouped by a particular key defined in the map function and the long-awaited data aggregation starts.
- Optionally, you can define the finalize function to run after reduce and perform the data finalization.
- We obtain the result of aggregation and use it in our application
Dl>
When is it possible to use MongoDB
You can start using MongoDB in your web applications right now. The latest versions are stable enough for use in the production. The project is being developed by a team of experts on a permanent basis, many bugs get fixed and new features are released (Project Ideas), plus it is used by a considerable number of Internet project, including SourceForge. There are also drivers written for popular programming languages.
As for Drupal a module has been written, which promises to help implement the above features (mostly in Drupal 7), but it is still under development. Also there is a driver for DBTNG (Database Layer: The Next Generation), which allows us to use MongoDB server just like any other Drupal database (of course excluding the execution of SQL-queries). There’s a new feature in Drupal 7 that allows storing fields (CCK Fields replacement) not only in a shared database, but also individually in different storages. This feature is also supposed to be implemented in the above module.
In a nutshell, we can easily use this database server in the new Drupal release to optimize performance and scalability.
Peculiarities of use
Let’s take a look at some of the differences in working with MongoDB compared to the usual SQL. All code samples are written in JavaScript and can be tested in the interactive console we’re gonna talk about.
Console tools
Similarly to the way we can execute SQL-queries in mysql console, MongoDB interactive console allows you to execute commands of the database server using JavaScript. The SpiderMonkey engine is used by default, we can change it for the V8. The official website provides the console’s online-version that runs directly in your browser, plus a small introduction for beginners. This is where we recommend you to start.
Information Structure
The following hierarchy exemplifies data structure in relational MySQL systems: Database -> Table -> String -> Field + Value
.
How it looks in MongoDB: Database -> Collection -> Document -> key + value
.
In relational databases tables must be strictly structured, while MongoDB allows creating random structure documents.
Sample JSON-style document:
{
doc_id: 153,
title: 'Some title',
body: '...A lot of text...',
author_id: 73,
date: 'Sun Oct 31 2010 03:00:00 GMT+0300 (MSK)',
additional: {
location: {
country: 'Russia',
city: 'Moscow'
},
category: 'books'
}
tags: ["tag1", "tag2", "tag3"]
}
As you can see we have complete freedom in embedding document keys that relieves us from the need to denormalize like in SQL, i.e. we don’t need store related pieces of information in separate documents. Just like SQL, MongoDB has indexes, moreover it fully supports them. We can build an index based on any key of the above document, including the embedded country or an array of tags. You can create composite indexes based on multiple document keys. Rules of optimizing indexes for SQL are mostly suitable for MongoDB and described in documentation.
Sample
Suppose we need to sample from a specific collection all the documents, where city value is "Moscow".
In SQL we have to use the JOIN manual, because in a unified system like Drupal we sometimes can’t write all the information into one table. For example:
SELECT docs.title, loc.city FROM documents docs
INNER JOIN doc_location d_loc ON d_loc.doc_id = docs.doc_id
INNER JOIN location loc ON loc.loc_id = d_loc.loc_id
And this is how it’s done in MongoDB, given that everything is stored within one document:
db.find({additional.location.city: "Moscow"});
Similarly to JOIN, we can also create links to objects in other collections without having to make separate queries to get related documents.
MongoDB handles a large number (millions) of documents, its sampling rate, like in SQL, is optimized by indexes and limits on the number of documents received within a single query, just like in usual relational databases indexes have a negative impact on the rate of sampling. There’s a familiar EXPLAIN operation that performs the same function it does in MySQL.
Entries
SQL has an INSERT operation for adding and an UPDATE operation for updating entries.
Creating an entry in MongoDB implies the use of three functions: insert, save and update.
Save function is a wrapper for update that simplifies the command syntax.
Examples:
// $doc – Any document
// Insert document
db.insert($doc);
//Updating or creating a document can be done in one of 2 ways
db.save ($ doc);
//or
db.update ({name: "Joe"}, $ doc, true); //the first argument is a condition, the second is a new document, and the third is the insertion if the original document was not found
//Atomic operation. Increase counter parameter by 1
db.update({name: "Joe"}, {$inc : {counter : 1}});
MongoDB supports several types of atomic operations.
We can use synchronous and asynchronous type of entry, asynchronous one is set by default, and it's faster, because the application doesn’t need to wait for server response.
MongoDB is recommended for a large number of simultaneous queries (over a thousand per second), especially when there’s a large number of write operations. Judging by the numerous reviews, fast entry is one of this database’s main advantages.
Speaking of entries, we should mentionCapped Collections.
The thing is, when writing to a regular collection key _id with a unique document identifier is implicitly added to each document. This key is also a basis for building index. Sizes of the collections are dynamic.
In collections like Capped everything is somewhat different. Key _id is being generated, but to speed up recording of an entry index is not based on it (by default). Also, the space these collections might occupy is preset, which also adds speed. However there are certain restrictions, such as updating documents is permitted only if the document size hasn’t changed, deleting documents is not supported, when the collection runs out of space new documents replace the old ones. When using Capped Collections the speed of recording documents will be similar to the speed of recording system logs. You can use this feature, for example, for statistics, caching and logging systems.
Aggregation
Suppose we have a task to sample from a table with comments the total number of votes for each author.
SQL offers a quite simple solution:
SELECT author, SUM(votes) FROM comments GROUP BY author;
MongoDB offers a more complex but also a more advanced solution called Map/Reduce.
In simple terms it is an alternative to the GROUP BY operation and aggregate functions (SUM, MAX, MIN, ...) for NoSQL (in our context). In general, this is how it works:
Let’s try to understand how it happens through an example (taken from here).
Document Structure
Suppose we have a collection of comments of the following structure (JSON):
{
text: "lmao! great article!",
author: 'kbanker',
votes: 2
}
This document includes two comments from "kbanker" with two votes.
Step by step.
Map function (mapping stage)
As we noted earlier, map is a JavaScript function, which scans each document and collects the necessary data in the key -> value pair format. This pair is generated by the emit operation:
// Key – the author’s user name;
// Value - the number of votes for a current comment.
var map = function() {
emit(this.author, {votes: this.votes});
};
Reduce function (aggregation stage)
Every reduce function call (one per key) gets two arguments: key and an array of values collected during mapping stage. In our example, the reduce call for the "kbanker" author will be something like this:
reduce('kbanker', [{votes: 2}, {votes: 1}, {votes: 4}]);
And now let’s describe the function for counting the votes:
var reduce = function(key, values) {
var sum = 0;
values.forEach(function(doc) {
sum += doc.votes;
});
return {votes: sum};
};
Let’s call two commands in the console, the first one for starting the operation, and the second one for receiving results:
var op = db.comments.mapReduce(map, reduce);
db[op.result].find();
As a result we get the grouped data and can use them in our application.
There’s a more detailed description in the original article (follow the link above), as well as in documentation.
It is important to take into account one peculiarity. Map/Reduce and mapping in particular are rather slow. It is recommended to fill the map function with highly optimized JavaScript-code. In a nutshell, opt for less assignment operations and cycles, the JS-code optimization is a different story (you can use different JS-engines in MongoDB). You can speed up this operation by scaling. If we spread our database in a cluster, the mapping will be carried out simultaneously on all the machines and process different data, and the result is that the speed of mapping is directly proportional to the number of servers in a sharding database. This is perhaps one of the main disadvantages of this database if used on a single server.
There is also a group function for aggregation, but we won’t dwell on it because it won’t work in a scaled architecture and, despite the fact that it runs much faster than map/reduce, it still does not solve the above problem and is still slow at a certain number of entries.
Administering
Let’s have a look at some aspects of database administration. You can find the complete documentation here.
Transferring DB / Backup
In order to create a database dump use mongodump utility, which comes in the server package by default. Creating a backup of the entire database is reduced to one command, for example: mongodump -d DATABASE_NAME
As a result we get a folder with BSON-format files. Use this folder to restore backup, for example: mongorestore BACKUP_FOLDER
.
Read the related documentation section.
Interactive Console
The above described interactive console is a great tool for database administration. We can use it to test queries, connection, create indexes, view the status of the current operation and for any other administrative functions.
Administrative interface
There are several visual tools for administration in MongoDB, including native clients for OS X and. NET, as well as PHP, Python and Ruby web interfaces. You can find them in the relevant section.
Summary
Evaluating MongoDB in terms of its use in Drupal we have to admit it certainly can’t be a complete replacement for the usual MySQL or PostgreSQL. It can be used to improve performance of individual elements of the architecture, especially in bottlenecks. For example, it’s a good idea to use this database for statistics, caching, storage of user sessions, maintaining a watchdog log, queue management, etc. In addition, modules can actively use this database for storing their data.
Got anything to add?