Eventual Consistency: How Eventual? How Consistent?

In my last post on Riak, I discussed how an application developer could use the values of N (number of replication nodes), R (read quorum), and W (write quorum) to fine tune the availability, latency, and consistency trade-offs of a distributed database.

The idea of fine tuning consistency with the values of N, W, and R was first made popular in 2007 when Amazon engineers published an article explaining the design principles of Dynamo, a highly available and scalable distributed database used internally at Amazon. The principles popularized by the article were incorporated into three popular open-source NoSQL databases: Riak, Cassandra, and Project Voldemort.

Recall that W+R>N assures consistency. Such a system of overlapping nodes is referred to as a strict quorum, in that every read is guaranteed to return the latest write. [See subsequent post that clarifies this point.] Many developers, however, choose to configure W+R<N for the sake of greater availability or lower latency. Such a system, known as a weak or partial quorum, does not guarantee consistency. But these systems do use various mechanisms to guarantee eventual consistency, meaning that in the absence of additional writes to a key, the key’s value will eventually become consistent across all N nodes. (For a good summary of eventual consistency, see the post on the topic by Werner Vogel, Amazon’s CTO.)

But eventual consistency is an odd guarantee. How long might eventual take? What is the probability of an inconsistent read? Perhaps because the answers to these questions depend on many factors, NoSQL providers do not provide specifics.

Now a group of computer science graduate students at Berkeley are in pursuit of answers. Using data samples from Internet-scale companies and statistical methods (most notably Monte Carlo analysis), the team has put together mathematical models and a simulation tool to determine both average and upper bound answers to these key questions regarding eventual consistency. They refer to the model as Probabilistically Bounded Staleness (PBS). The model factors in the values of N, W, and R as well as sample latencies for read write operations. As of yet, the model does not account for nodes entering or leaving the cluster.

PBS is a brilliant application of statistics to a computer science problem. I enjoyed hearing Peter Bailis, a member of the research team, describe the research at a Basho-sponsored meet-up this Tuesday. To learn the details of PBS, visit the team’s web site. If your business depends on the tradeoffs of eventual consistency, this research is of tremendous importance.

Posted in Uncategorized

Exploring NoSQL: Fine Tuning CAP with Riak

In previous posts, I explored MongoDB and Redis. MongoDB, a document-oriented database, allows developers to work with JSON documents in interesting ways. Redis offers developers a set of fundamental data structures upon which to build applications. Riak, in contrast, offers only the simplest of interfaces: put a key-value pair and get a value by its key. Values are simple binary strings, of which Riak understands nothing. It cannot operate on these values as JSON objects or sets or hashes. It cannot even increment a scalar numeric value. But Riak does provide developers with the power to fine tune the trade-offs between consistency, availability, and partition tolerance.

To understand the why of Riak, let us review the CAP Theorem, the conjecture that distributed databases must sacrifice one of three desirable characteristics: consistency, availability, or partition tolerance. Partitioning, in this context, means a network failure that prevents some of the nodes in a database from communicating with others even though each node may remain accessible to some clients.  At a large enough scale of distribution, with hundreds or thousands of nodes scattered around the globe, the possibility of partitioning becomes real enough that it must be anticipated. And if partitioning is anticipated, it follows that it may not always be possible to distribute a write across all nodes such that the value remains consistent. And if a write cannot be distributed across all nodes, the database must either reject the write, making the system unavailable, or commit the write to only a subset of nodes, making the system inconsistent.

So scalability demands distribution, and distribution demands trade-offs, trade-offs that developers must understand, control, and accommodate. Riak enables developers to fine tune these trade-offs through three numeric dials: N, W, and R. Each value in Riak will be replicated to N nodes. Each write will be considered a success once the data has been replicated to W nodes. And each read will return successfully after results are returned from R nodes. Adjusting these three values creates data stores of fundamentally different natures.

The difference by which N exceeds W defines how many nodes may become unavailable before a write would fail. The greater this difference, the greater the availability, but the greater the likelihood of inconsistency. Likewise, the difference by which N exceeds R defines how many nodes may become unavailable before a read would fail. The lower R, the greater the availability, the greater the load capacity, the lower the latency, but the greater the chance of inconsistency. If W plus R is greater than N, then the database achieves a form of consistency; because of the overlap, the read will include at least one node holding the latest value. [See subsequent post that clarifies this point.]

If a network partitions such that more than one partition contains at least W servers, it is possible for more than one client to update the same value, creating separate but valid versions. Riak refers to these forked versions as siblings. And while Riak can resolve the conflict in favor of the latest write, it could also be configured to return multiple siblings, leaving the work of resolving the conflict to the client application, which makes sense because the client understands the business logic.

And how does Riak distribute values to N nodes? Riak hashes each key using a consistent algorithm, producing an integer that falls within one of many equally sized partitions of a 160-bit integer space, an addressing system Riak refers to as a ring. Each Riak vnode (one physical node hosts multiple vnodes) claims one partition, making it responsible for storing values with key hashes falling within the address range of that partition. Multiple vnodes will claim the same partition, which forms the basis of replication. Each node tracks the vnodes available in the cluster and their respective partitions, making it possible for any node to direct a write across the necessary number of vnodes.

Riak supports three application programming interfaces, a native Erlang interface, a Protocol Buffers interface, and a RESTful HTTP interface. The REST interface provides a really simple means to put and get values. And Basho, the company behind Riak, provides drivers for several popular languages: Erlang, Javascript, Java, PHP, Python, and Ruby.

To get started setting up a development server and exploring the API, I suggest following along with the Riak fast track tutorial.

Related Posts:
Exploring NoSQL: MongoDB
Exploring NoSQL: Memcached
Exploring NoSQL: Redis
Exploring NoSQL: Couchbase

Posted in NoSQL, Riak

MongoDB Aggregation Framework: The SQL of NoSQL

No, it’s not really SQL. Its syntax resembles JSON. It returns a collection of JSON documents. But the Aggregation Framework addresses for MongoDB what many have found lacking in non-relational databases, the capability for ad hoc queries that SQL performs so well.

Actually, MongoDB already supports a variety of query mechanisms: queries on object properties, regular expression queries, and complex queries using JavaScript methods. And for those aggregation queries for which a SQL developer would use a group by clause, a developer using MongoDB could use MongoDB’s built-in facilities for MapReduce, a programming pattern for grouping data into buckets and iterating through each bucket.

But JavaScript-based queries require a developer to write a function. And MapReduce requires two functions, one for the map (grouping) and one for the reduce (analysis of the group). While doable, it is quite a bit of work compared to the ease of SQL. And it is this problem that the Aggregation Framework addresses.

The Aggregation Framework provides a declarative syntax for creating query expressions. The declarative statements work together like piped commands on a UNIX shell. The statements very much resemble their SQL counterparts. There is a match statement to identify records for inclusion, a sort statement for sorting, and a group statement for grouping. Grouping supports all of the aggregation functions one would expect: average, first, minimum, maximum. And there are all of the string and date extraction statements similar to SQL.

I found the unwind statement particularly interesting. Unwind pulls out elements from an array. MongoDB documents are JSON objects, which may include a hierarchy of child objects, including arrays of objects. The unwind statement returns a copy of the parent document for each element within the child array, but instead of the entire array, each document displays only the value of one array element. The result looks much like you would expect from a SQL join.

The Aggregation Framework is not yet included in MongoDB, though it is targeted for inclusion in version 2.2 due out in March. Last night at the SF Bay MongoDB Meet-up, Chris Westin of 10gen gave a preview and demo to the crowd. For more information, see his slide deck at http://www.slideshare.net/cwestin63/mongodbs-new-aggregation-framework.

For any organization considering NoSQL databases, the Aggregation Framework will certainly ease the SQL to NoSQL transition.

Posted in MongoDB, NoSQL

Exploring NoSQL: Redis

Redis differs so dramatically from traditional databases that you’d be forgiven for not recognizing the family resemblance. Indeed, it’s possible to run Redis as a key-value memory cache, mimicking the functionality of Memcached, which itself is not a database. Like Memcached, Redis neither indexes nor searches upon values. And like Memcached, Redis may store anything as a value—a string, an integer, a serialized object, a video.

So what makes Redis a databases? It supports persistence. Actually, Redis supports two modes of persistence. One mode is snapshotting. At regular time intervals, Redis takes a snapshot of the data in its memory and stores it in an RDB file, a compact, point-in-time representation of the data. The other mode, AOF (append-only file), persists changes to a file either per command or at a regular interval. According to the Redis documentation, persisting per command is slow, but reliable. Redis recommends persisting once per second to provide good performance and reasonable reliability (at most, one second worth of data can be lost). It is possible to turn on one, both, or neither of these persistence options. Using both together would most closely match the persistence of a relational database.

Moreover, Redis does support transactions. Redis executes each command within its own transaction. The increment command, which both retrieves and adds to a value, is guaranteed to operate atomically. No two clients may retrieve and increment a value in such a way that one increment overwrites the other. And Redis supports multi-command transactions. A programmer need only issue a multi command, the set of commands required in the transaction, and a then an exec command, which triggers the running of the commands within a single transaction.

Redis also supports replication from a master to one or more slaves. However, unlike MongoDB, Redis does not support the notion of a replication set. If a master fails, the system will not fail over without intervention. The Redis team may include automatic failover in Redis Cluster scheduled for release by the end of 2012. (See http://redis.io/download.)

Unlike MongoDB and Memcached, Redis does not support scalability in an automated fashion. In general, scaling a key-value data store is accomplished through distributing objects across multiple servers based on keys. Memcached automates this functionality through hashing algorithms implemented in client drivers. Redis provides no such automatic key distribution. Application programmers need to take on this work themselves. Again, Redis Cluster may include this capability.

While Redis, like Memcached, neither indexes nor queries on values, it is not quite accurate to say Redis understands nothing about the values stored in its memory. It cannot parse JSON documents in the manner of MongoDB, but Redis provides several data types of its own, all highly useful and familiar to any student of computer science. Indeed, the documentation describes Redis as a “data structures server,” obviously not a phrase dreamed up by a marketing executive.

Each data type includes a set of commands traditionally associated with the data structure. Here is a complete list of the types and a partial list of the commands supported by each. Note that each command takes a key as its first argument.

  • The String data type stores binary data.
    • SET         Associates a value with a key
    • GET        Retrieves value based on a key
    • DEL         Removes a value based on a key
    • INCR      Increments a numeric value
    • DECR     Decrements a numeric value
  • The List data type stores a sequence of ordered elements.
    • RPUSH  Adds value to end of list
    • LPUSH   Adds value at beginning of list
    • LRANGE               Retrieves values from beginning to end of specified range
    • LLEN      Length of list
    • RPOP     Return and remove item from end of list
    • LPOP     Return and remove item from beginning of list
  • The Set data type stores a set of unordered, non-duplicate elements.
    • SADD   Add element to set
    • SREM   Remove element from set
    • SUNION            Combine two sets
    • SISMEMBER    Check whether item is a member of the set
    • SMEMBERS     Return all members of set
  • The Sorted Set stores a set that is sorted by an associated score.
    • ZADD     Adds element to a set with a score
    • ZRANGE               Returns elements from set from beginning to end of specified range
    • SRANGEBYSCORE             Returns elements from set based on score
  • The Hash data type maps string fields and string values.
    • HSET      Set a value based on a key
    • HGET     Retrieve a value based on a key
    • HDEL      Delete a value based on a key
    • HKEYS   Retrieve all keys
    • HVALS   Retrieve all values

The easiest way to get a sense of the API is to try out the online tutorial at http://try.redis-db.com. Redis, which is free and open source, has really clear documentation at http://redis.io/documentation. And getting Redis up and running on Ubuntu Linux took me just a few minutes (http://redis.io/download).

Related Posts:
Exploring NoSQL: MongoDB
Exploring NoSQL: Memcached
Exploring NoSQL: Riak
Exploring NoSQL: Couchbase

Posted in NoSQL, Redis

Exploring NoSQL: Memcached

Memcached might seem an odd place to venture in an exploration of NoSQL databases. It is not a database. It provides no persistence. It purges items from memory to free space as needed. And there is no replication. If data gets lost, an application retrieves it anew from a persistent data store.

Rather than a database, Memcached is a distributed memory cache of key-value pairs. Its application programming interface follows the pattern of a hash table. Using a key, programmers set and get values, which could be anything. Yes, memcached increments and decrements numeric values, but in general it understands nothing about the structure of values stored in its memory. It neither indexes nor searches based on values. To use a value, a programmer must retrieve it via a key and convert it to an object defined in the programming language.

To use Memcached, install and run the service on one or more servers. (I literally had it up and running on Ubuntu in minutes.) And install the driver for your programming language of choice. In code, indicate the Memcached servers in use and begin to set and get key-value pairs. The client code included in the driver distributes your data across servers based on a hashing algorithm. The Memcached servers do not need to communicate with each other. The system is remarkably simple, which explains its appeal.

Memcached users include Twitter, YouTube, Flickr, Craigslist, and WordPress.com, the host of this blog. Indeed, you are now reading words that were likely cached in Memcached.

So why consider Memcached in this exploration of NoSQL? I take this detour because as we explore NoSQL databases, it will be useful to compare their functionality to Memcached, with Memcached serving as an example of extreme simplicity. Hopefully, this point will become more apparent in my next post as I explore Redis.

Related Posts:
Exploring NoSQL: MongoDB
Exploring NoSQL: Redis
Exploring NoSQL: Riak
Exploring NoSQL: Couchbase

Posted in NoSQL

Harmonic: A MongoDB Success Story

After writing my last post on MongoDB, I attended a meet-up at the Mozilla office in San Francisco to hear the tale of a real company in the process of migrating from Microsoft SQL Server to MongoDB.

The company, Harmonic, sells enterprise software for managing workflows around video. Videos come in, go through checks, conversions, and other processing, and get distributed over multiple channels.  (Ok, vast simplification, but that’s the gist of it.) The architecture consists of a GUI and other management tools on the front-end, a set of services for processing videos on the back-end, and a workflow engine that orchestrates the process. The workflow engine stores its state in a database, and that database had been Microsoft SQL Server.

The marketing staff demanded that engineering reduce complexity for customers, increase scalability, and keep costs low. Nick Vicars-Harris, who manages the Harmonic engineering team, experimented with MongoDB. It took just a few days to tweak the data layer, written in C# and utilizing LINQ, to work with MongoDB rather than SQL Server. According to Vicars-Harris, Harmonic removed code that had been needed for object relational mapping, refactored, and produced more intuitive code. Rather than normalizing workflow state across over twenty tables, Harmonic could now store each job and its related tasks in a single document. In addition to removing complexity, the solution passed the test for scalability.

Harmonics also took advantage of the MongoDB simplified deployment model to create what Vicars-Harris calls smart nodes, nodes that communicate with each other and self-configure, a solution that met the requirements for simplified deployment and maintenance.

After listening to the presentation, I was impressed with the ease of transition from SQL to NoSQL. Clearly, the workflow use case fits in well with document-oriented databases.

Posted in MongoDB, NoSQL, Workflow

Exploring NoSQL: MongoDB

MongoDB is one of the most popular of open source NoSQL databases. Supported by 10gen and boasting a long list of deployments including Disney, Craigslist, and SAP, MongoDB has a remarkably simple application programming interface (API) and all the tools necessary for massive scalability.

MongoDB is a document-oriented NoSQL data store, but the word document might mislead. Really, MongoDB stores objects much like an object database. More technically, MongoDB stores BSON documents, which stands for Binary JSON. JSON stands for JavaScript Object Notation, which is a text-based format for storing structured data, much like XML but less verbose, with more colons and curly braces than angle brackets.

MongoDB stores each object as a document, which might be thought akin to a record in the relational world. While there is no schema in MongoDB, that is no defined set of columns, developers would normally store like objects (objects created from the same class) together in a collection, which might be thought of as a table. A database would typically consist of several such collections. And while there are no relationships between collections, a JSON object can itself contain a hierarchy of other JSON objects or fields that link to objects in other collections, so it is possible to model a variety of relationships.

Developers access MongoDB through a driver, which maps a MongoDB document to a familiar construct in the developer’s language of choice. JSON objects, as the name implies, map easily to JavaScript objects. In Python, a document maps to a dictionary. In C#, a document maps to a specially defined class called a BsonDocument. Regardless of the language, the API is quite straight forward.

Object databases remove the grunt work of mapping classes to relational data models. But object databases never caught on, perhaps because of the difficulty of ad hoc queries. MongoDB does provide a variety of query mechanisms. It allows for queries by object properties, queries based on regular expressions, and complex queries using JavaScript methods. Whether any of these approaches satisfies requirements for ad hoc queries depends on the specific application scenario.

MongoDB achieves scalability through sharding, which divides objects in a collection between different servers based on a key. If the developer defines zip code as the shard key, for example, a customer object with a New York City zip code might be stored on a different server than a customer object with a San Francisco zip code. MongoDB handles the work of distributing the data.

Sharding is distinct from replication. Each MongoDB shard can be configured as a replicate set, which provides asynchronous master-slave replication with automatic failover and recovery of member nodes. In production, a replica set generally consists of at least three nodes running on separate machines, one of which serves as an arbiter. The arbiter breaks ties when the cluster needs to elect a new primary node. Drivers automatically detect when a replica set’s primary node changes and begin to send writes to the new primary. The replication process uses logs in much the same way as relational databases. And while non-primary replicas could be used for reads to speed performance, the primary purpose of replication is reliability. (Paragraph updated 3 Feb 2012.)

Reliability brings to mind transactions. Relational databases support transactions across multiple records and tables; MongoDB restricts transactions to single documents. While this might appear overly restrictive, it could be made to work in many scenarios. Recall that a document can include a hierarchy of objects. If an application’s data is modeled such that all of the data requiring all-or-nothing modification resides within one document, the MongoDB approach would suffice.

However, this restriction on transactions calls attention to the design objectives of MongoDB: speed, simplicity, and scalability. Expanding transactions to encompass multiple objects stored across shards would significantly impact performance leading to complex dead-lock problems. Indeed, by default, a save function call to the MongoDB returns immediately to the application without waiting for confirmation that the save was successfully persisted. If a networking or disk failure prevents the write, the application continues without awareness of the error. However, the MongoDB API does provide options for safe writes that wait for a success response. There are even options to specify how many replication slaves must get updated before the save is considered a success. So while speed is the default, reliability is a possibility.

While I mentioned earlier that MongoDB provides drivers for a variety of languages, it holds particular appeal to JavaScript devotees. JSON documents were designed to store JavaScript objects. JavaScript is the MongoDB language for complex queries. And the command line tool for managing MongoDB is built on top of the JavaScript shell. So if you have mastered JavaScript for the coding of dynamic web pages, MongoDB provides an opportunity to expand its use.

I recommend visiting MongoDB.org and trying out the online shell. In a few minutes, you’ll get a sense of the API. Then take a look at the tutorial. And to experiment further, download and install MongoDB for yourself. (I managed to install it on Windows 7 in a few minutes, but somehow got stuck installing the package on Ubuntu.)

Related Posts:
Exploring NoSQL: Memcached
Exploring NoSQL: Redis
Exploring NoSQL: Riak
Exploring NoSQL: Couchbase

Posted in MongoDB, NoSQL
%d bloggers like this: