Exploring NoSQL: MongoDB

MongoDB is one of the most popular of open source NoSQL databases. Supported by 10gen and boasting a long list of deployments including Disney, Craigslist, and SAP, MongoDB has a remarkably simple application programming interface (API) and all the tools necessary for massive scalability.

MongoDB is a document-oriented NoSQL data store, but the word document might mislead. Really, MongoDB stores objects much like an object database. More technically, MongoDB stores BSON documents, which stands for Binary JSON. JSON stands for JavaScript Object Notation, which is a text-based format for storing structured data, much like XML but less verbose, with more colons and curly braces than angle brackets.

MongoDB stores each object as a document, which might be thought akin to a record in the relational world. While there is no schema in MongoDB, that is no defined set of columns, developers would normally store like objects (objects created from the same class) together in a collection, which might be thought of as a table. A database would typically consist of several such collections. And while there are no relationships between collections, a JSON object can itself contain a hierarchy of other JSON objects or fields that link to objects in other collections, so it is possible to model a variety of relationships.

Developers access MongoDB through a driver, which maps a MongoDB document to a familiar construct in the developer’s language of choice. JSON objects, as the name implies, map easily to JavaScript objects. In Python, a document maps to a dictionary. In C#, a document maps to a specially defined class called a BsonDocument. Regardless of the language, the API is quite straight forward.

Object databases remove the grunt work of mapping classes to relational data models. But object databases never caught on, perhaps because of the difficulty of ad hoc queries. MongoDB does provide a variety of query mechanisms. It allows for queries by object properties, queries based on regular expressions, and complex queries using JavaScript methods. Whether any of these approaches satisfies requirements for ad hoc queries depends on the specific application scenario.

MongoDB achieves scalability through sharding, which divides objects in a collection between different servers based on a key. If the developer defines zip code as the shard key, for example, a customer object with a New York City zip code might be stored on a different server than a customer object with a San Francisco zip code. MongoDB handles the work of distributing the data.

Sharding is distinct from replication. Each MongoDB shard can be configured as a replicate set, which provides asynchronous master-slave replication with automatic failover and recovery of member nodes. In production, a replica set generally consists of at least three nodes running on separate machines, one of which serves as an arbiter. The arbiter breaks ties when the cluster needs to elect a new primary node. Drivers automatically detect when a replica set’s primary node changes and begin to send writes to the new primary. The replication process uses logs in much the same way as relational databases. And while non-primary replicas could be used for reads to speed performance, the primary purpose of replication is reliability. (Paragraph updated 3 Feb 2012.)

Reliability brings to mind transactions. Relational databases support transactions across multiple records and tables; MongoDB restricts transactions to single documents. While this might appear overly restrictive, it could be made to work in many scenarios. Recall that a document can include a hierarchy of objects. If an application’s data is modeled such that all of the data requiring all-or-nothing modification resides within one document, the MongoDB approach would suffice.

However, this restriction on transactions calls attention to the design objectives of MongoDB: speed, simplicity, and scalability. Expanding transactions to encompass multiple objects stored across shards would significantly impact performance leading to complex dead-lock problems. Indeed, by default, a save function call to the MongoDB returns immediately to the application without waiting for confirmation that the save was successfully persisted. If a networking or disk failure prevents the write, the application continues without awareness of the error. However, the MongoDB API does provide options for safe writes that wait for a success response. There are even options to specify how many replication slaves must get updated before the save is considered a success. So while speed is the default, reliability is a possibility.

While I mentioned earlier that MongoDB provides drivers for a variety of languages, it holds particular appeal to JavaScript devotees. JSON documents were designed to store JavaScript objects. JavaScript is the MongoDB language for complex queries. And the command line tool for managing MongoDB is built on top of the JavaScript shell. So if you have mastered JavaScript for the coding of dynamic web pages, MongoDB provides an opportunity to expand its use.

I recommend visiting MongoDB.org and trying out the online shell. In a few minutes, you’ll get a sense of the API. Then take a look at the tutorial. And to experiment further, download and install MongoDB for yourself. (I managed to install it on Windows 7 in a few minutes, but somehow got stuck installing the package on Ubuntu.)

Related Posts:
Exploring NoSQL: Memcached
Exploring NoSQL: Redis
Exploring NoSQL: Riak
Exploring NoSQL: Couchbase

I'm the Director of Threat Solutions at Shape Security, a top 50 startup defending the world's leading websites and mobile apps against malicious automation. Request our 2017 Credential Spill Report at ShapeSecurity.com to get the big picture of the threats we all face. See my LinkedIn profile at http://www.linkedin.com/in/jamesdowney and follow me on Twitter at http://twitter.com/james_downey.

Posted in MongoDB, NoSQL
4 comments on “Exploring NoSQL: MongoDB
  1. Brian Adler says:

    Thanks for your thoughts on your experiences with MongoDB. As you mention, MongoDB is of particular appeal to folks familiar with JavaScript, but we actually see it used almost exclusively in PHP implementations (granted, this is most likely due to PHP’s dominant market presence). The question I did want to pose was in regard to your comment that “Each shard in a MongoDB cluster could be configured to replicate to one or more slave instances…”. I assume you are using “slave” as a general term in that what we are seeing (almost exclusively) is the use of three-node replica sets (per shard), as the MongoDB community has gone away from the master/slave configuration in favor of the replica set implementation. While some customers run two nodes and an arbiter, the recommendation we make is to use all three nodes as data nodes – you have three servers running, and storage is typically cheap, so it makes sense to populate data on all the “voters” so any of them can take the role as the primary node.

    Looking forward to your thoughts on the other members of the NoSQL club…

  2. James Downey says:

    Thanks again for the thoughtful comment. Yes, I went too far in simplifying the way replication works. I’ll try to revise within a few days.

  3. James Downey says:

    By the way, I blog as a way to learn. I research and try out new technologies and use the blog to put my thoughts together. And I hope that somebody in the community will point out where I’m wrong, precisely as you have done. Thanks.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: