Thoughts on CouchConf: Data Modeling for Flow

Last Friday, I had the pleasure of attending CouchConf in San Francisco to learn about the new and exciting features of Couchbase 2.0. With the new release, Couchbase becomes a full document oriented database . Meanwhile, Couchbase keeps its focus on performance, reporting some very impressive performance statistics comparing Couchbase to other NoSQL databases. Of all the exciting presentations at the conference, I most enjoyed the session on data modeling by Chris Anderson, Chief Architect for Mobile at Couchbase.

Document-oriented databases such as Couchbase are schemaless, making data modeling more flexible. While in the relational world there is generally one right answer for how to divide data across tables to achieve fifth normal form, in the document-oriented world there are fewer clear cut rules. For example, whether blog comments get stored with the post or in separate documents depends more on expected usage than any inherent constraints of the system or rigid design rules.

Without the schema constraints of the relational model, developers become free to think more about data flow. Chris Anderson uses phrases such as “emergent schema” and “runtime-driven schema” to describe this new emphasis. He concludes “thinking about throughput, latency, update and read patterns is the new data modeling.”

In my experience, emphasizing data flow, or data in motion as opposed to data in rest, makes developers better at understanding user needs. In my early career, when playing the role of business analyst, I went too far in allowing the relational model to shape my thinking about business requirements. With a relational schema always in the back of my mind, I focused my questions on whether data relations were one-to-one, one-to-many, or many-to-many. While I did write use cases, normalization rules shaped my thoughts.

Then several years ago, I worked with a colleague who focused more on process. He’d start conversations with “…now walk me through your process.” And that led us to a more fruitful discovery. At some point, I’d ask questions to better understand the data model, but we were in a better place when I started to ask those questions.

Of course, developers have always thought about use cases and update and read patterns. But if Chris Anderson is right and document databases make it easier to focus on data in motion rather than data at rest, then I think we’ll do a better job of thinking about user needs.

Posted in Couchbase, NoSQL

Not NoSQL, NewSQL

This Monday at the Silicon Valley NewSQL meetup in Mountain View, Michael Stonebreaker took turns bashing both the established relational databases (Oracle, DB2, SQL Server, PostgreSQL)  and the NoSQL newcomers (MongoDB, Cassandra, Riak), proposing a third alternative, VoltDB, a NewSQL database.

Stonebreaker—a leading researcher in the field of database design, former UC Berkeley professor and now MIT professor, winner of the IEEE John von Neumann Medal, and a designer of PostgreSQL—argued that the established databases have become legacy bloatware incapable of scaling to modern requirements without complete redesign. According to Stonebreaker’s research, these systems, all of which follow a similar design, use only a small percentage of CPU cycles (about 4%) on useful work. The bulk of CPU cycles go to overhead, divided fairly evenly into four categories of about 24% each:

  • Managing the buffer pool of disk pages cached in memory
  • Multi-row locking for transactions
  • Latching memory objects such as b-trees to prevent corruption in a multi-threaded environment
  • Write-ahead logging to disk

The NoSQL databases, according to Stonebreaker, solve these problems, but they do so by jettisoning SQL and ACID. Giving up SQL, Stonebreaker argued, makes no sense. The SQL standard has proven itself as a time-saving high level language that successfully depends on compilation to generate low level commands. Going backwards to row-level commands and unique APIs for each database, Stonebreaker claimed, is comparable to giving up C for assembler.

Stonebreaker also argued against giving up ACID, a requirement (or potential requirement) for almost all applications. If a database does not provide ACID, application developers will need to write this complex code themselves.

Stonebreaker proposed instead his product, VoltDB, a relational database that supports ACID and most of the SQL standard. VoltDB avoids the overhead of buffer management by keeping all data in memory. It avoids the overhead of row locking and memory object latching by using a single thread per partition. Only one thread touches memory objects, and transactions run sequentially on the one thread. And instead of write-ahead logging of data, VoltDB takes periodic snapshots of the database and logs only commands, which is faster but still capable of rebuilding the database from disk in case of failure. (See the VoltDB Technical Overview for more details.)

Like most of the NoSQL databases, VoltDB supports scalability across commodity hardware by sharding data based on keys. According to Stonebreaker, the choice of key is critical to performance, as joins and transactions that cross partitions degrade performance, a problem that cannot be solved even by eliminating the overhead of traditional RDMS. VoltDB makes scaling possible, but application developers must still give careful thought to how to partition data so that most operations only touch a single partition.

One could argue that this latter admission proves the NoSQL case against relational databases, namely that a database supporting ACID cannot scale. VoltDB scales only as long as transactions do not cross partitions. In a sense, VoltDB can be thought of as many small, fast databases that support ACID or one large database that supports ACID but does not scale. In other words, VoltDB does not solve the CAP dilemma.

Certainly, VoltDB will make sense for certain use cases, where there is a need for lightning speed and transactional integrity, where data can be sharded into largely autonomous partitions, and where VoltDB’s only partial implementation of the SQL standard fulfills requirements. But VoltDB will not replace traditional RDMS anytime soon, as it lacks much of the functionality that enterprises expect, bloatware though that might be.

Nor will VoltDB eliminate the demand for NoSQL, because many organizations will find a NoSQL database out there that fits well with its specific requirements. If all you need is a key-value store, why not choose a database that specializes in this function. If your data takes the shape of a graph, why not choose a database tailor made to the purpose.

Moreover, Stonebreaker overstates the case against NoSQL APIs. Yes, SQL is a proven high level language for data access, but it does not always fit well with object models used by application developers, which is why organizations have turned to solutions for object-relational mapping.  In many ways, objects map more intuitively to document-oriented databases than to SQL. Besides, a central tenet of NoSQL is the embrace of variety, the rejection of the one size fits all mentality. With that in mind, the diversity of NoSQL APIs is a benefit; the diversity allows developers to choose the one or more APIs that best fit a specific requirement.

Whether or not we accept all of Stonebreaker’s claims, the VoltDB retake on the traditional RDMS makes clear that these are exciting times in the world of databases.

Posted in NewSQL, NoSQL

Exploring NoSQL: Couchbase

In February 2011, Membase, Inc. and CouchOne merged to combine the strengths of their two open-source NoSQL projects, Membase and CouchDB. The joint team released Couchbase 1.8 in January, 2012 as an upgrade to Membase 1.7. Version 2.0 is now available as a developer preview. Meanwhile, CouchDB lives on as an independent project under Apache.

Membase, developed by several leaders of the Memcached project, maintains protocol compatibility with Memcached, making it possible to plug in Membase as a replacement for Memcached without rewriting application code. Like Memcached, Membase is a key-value store supporting set and get operations. But Membase adds persistence and replication. While Memcached removes items from cache when it runs out of memory, Membase makes room by moving items to disk. And while Memcached stores each item on a single server, Membase replicates items to additional servers, first to memory, and then to disk as needed.

CouchDB is a document-oriented database that stores values as JSON documents. Like Riak, CouchDB is written in Erlang, a language designed for distributed, concurrent, fault-tolerant, soft real-time systems.

The merged result, Couchbase, combines the capabilities of both products, serving as either a key-value store or as a document-oriented database, depending on the use case and values being stored. Like Membase, Couchbase can replace Memcached without a code rewrite, meaning it can function as a key-value store in the manner of Memcached. Indeed, the Couchbase API supports the set and get of arbitrary binary values without forcing values to conform to JSON. And while internally all items get stored as JSON with arbitrary values kept as attachments, this need not matter to a developer. Indeed, there are quite a few use cases in which it would make sense to use Couchbase as a key-value store and ignore its document-oriented capabilities.

For values that do conform to JSON, however, Couchbase provides document-oriented capabilities. Indeed, it shares many features with MongoDB. Couchbase organizes data into buckets, a concept very much akin to a MongoDB collection (and comparable to a relational table minus the schema). Both databases support sharding, replication, automatic failover, and the ability to add capacity without downtime. Both provide rich monitoring capabilities. And both emphasize consistency over high-availability, requiring that all writes go to a single master responsible for a segment of a key space so that every read consistently returns the latest value.

And both Couchbase and MongoDB enable developers to retrieve collections of documents using fields other than the primary key, but the manner in which they do so varies significantly. Each has taken a different data retrieval feature from the relational database world as its starting point. MongoDB supports secondary indexes and ad-hoc queries, much like SQL but without joins. Couchbase supports materialized views. A view is very much like a pre-written query. A materialized view stores the sorted results of the query in memory or on disk ready for fast retrieval. In Couchbase, a developer creates these materialized views using JavaScript map-reduce functions. Couchbase supports updating views either on writes or reads, the latter approach making sense following bulk updates. While not as easy to define as a MongoDB query, a Couchbase view, just like a view in a relational database, can look dramatically different than the underlying data and may include aggregated values.

Because views on large data sets can take a long time to materialize, Couchbase provides for developer views, which run against a random subset of data. When ready, a developer can publish a view to production and have it materialized against an entire data set. Last week, I had the chance to watch Matt Ingenthron, Director of Developer Solutions at Couchbase, and Dustin Sallings, Couchbase’s Chief Architect, demo this feature at a Couchbase user group in San Francisco. It is clearly a critically important feature for large data sets.

Whether the Couchbase or MongoDB model makes the most sense for any application depends on the use case. If an application requires a key-value store in some places and document-oriented features in others, Couchbase might make sense. And if an application built on Memcached now needs a more reliable cache, Couchbase may well make a good fit. But for developers looking for ad-hoc queries and indexing reminiscent of relational databases, MongoDB would feel more comfortable. No matter your preference, the evolution of these two open-source document-oriented NoSQL databases demonstrates that choice is alive and well in the open source community.

Related Posts:
Exploring NoSQL: MongoDB
Exploring NoSQL: Memcached
Exploring NoSQL: Redis
Exploring NoSQL: Riak

Posted in Couchbase, CouchDB, MongoDB, NoSQL, Uncategorized

Be Careful with Sloppy Quorums

In my last posts on eventual consistency, I mentioned that R+W>N guarantees consistency. Thanks to commentator senderista for pointing that this statement does not hold in the case of sloppy quorums.

The Amazon Dynamo article describes sloppy quorum as such:

If Dynamo used a traditional quorum approach it would be unavailable during server failures and network partitions, and would have reduced durability even under the simplest of failure conditions. To remedy this it does not enforce strict quorum membership and instead it uses a “sloppy quorum”; all read and write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while walking the consistent hashing ring.

In other words, the cluster has more than N nodes. And in the case of network partitions, writes could be made to nodes outside of the set of top N preferred nodes. In this case, there would be no guarantee that writes and reads over the N nodes would overlap since the nodes that constitute N are in flux. Therefore, the formula R+W>N has no meaning.

For example, if N equals three in a cluster of five nodes (A, B, C, D, and E) and nodes A, B, and C are the top three preferred nodes, then minus any error conditions writes will be made to nodes A, B, and C. But if B and C were not available for a write, then a system using a sloppy quorum would write to D and E instead. If this were to happen, then even if the write quorum (W) was 3 and the read quorum (R) was 2, making R+W>N, a read immediately following this write could return data from B and C, which would be inconsistent because only A, D, and E would have the latest value.

According to the Amazon Dynamo article, Dynamo mitigates this inconsistency through hinted handoffs. If the system needs to write to nodes D and E instead of B and C, it informs D that its write was meant for B and informs E that its write was meant for C. Nodes D and E keep this information in a temporary store and periodically poll B and C for availability. Once B and C become available, D and E send over the writes.

When balancing the tradeoffs between consistency and availability, it is vital to understand how any particular system handles quorums, whether strict or sloppy.

In some cases, I’ve found the literature unclear on this point. In a post on eventual consistency, Werner Vogels writes “If W+R > N, then the write set and the read set always overlap and one can guarantee strong consistency.” While true in a strict quorum system, this statement is not true in the case of Dynamo and not true of systems based on Dynamo that utilize sloppy quorums.

A page in the Riak wiki states that “R + W > N ensures strong consistency in a cluster” and includes a reference to the post by Vogels on eventual consistency. However, a recent Basho posting states that Riak uses sloppy quorums by default, though it uses strict quorums whenever the values of PR and PW are used rather than R and W. Overall, I didn’t find the Riak documentation clear on this important distinction.

Thanks again to the commentator who pointed out my mistake. As I continue my series exploring NoSQL databases, I’ll be more careful to point out where sloppy quorums could affect consistency.

Posted in NoSQL, Riak

Eventual Consistency: How Eventual? How Consistent?

In my last post on Riak, I discussed how an application developer could use the values of N (number of replication nodes), R (read quorum), and W (write quorum) to fine tune the availability, latency, and consistency trade-offs of a distributed database.

The idea of fine tuning consistency with the values of N, W, and R was first made popular in 2007 when Amazon engineers published an article explaining the design principles of Dynamo, a highly available and scalable distributed database used internally at Amazon. The principles popularized by the article were incorporated into three popular open-source NoSQL databases: Riak, Cassandra, and Project Voldemort.

Recall that W+R>N assures consistency. Such a system of overlapping nodes is referred to as a strict quorum, in that every read is guaranteed to return the latest write. [See subsequent post that clarifies this point.] Many developers, however, choose to configure W+R<N for the sake of greater availability or lower latency. Such a system, known as a weak or partial quorum, does not guarantee consistency. But these systems do use various mechanisms to guarantee eventual consistency, meaning that in the absence of additional writes to a key, the key’s value will eventually become consistent across all N nodes. (For a good summary of eventual consistency, see the post on the topic by Werner Vogel, Amazon’s CTO.)

But eventual consistency is an odd guarantee. How long might eventual take? What is the probability of an inconsistent read? Perhaps because the answers to these questions depend on many factors, NoSQL providers do not provide specifics.

Now a group of computer science graduate students at Berkeley are in pursuit of answers. Using data samples from Internet-scale companies and statistical methods (most notably Monte Carlo analysis), the team has put together mathematical models and a simulation tool to determine both average and upper bound answers to these key questions regarding eventual consistency. They refer to the model as Probabilistically Bounded Staleness (PBS). The model factors in the values of N, W, and R as well as sample latencies for read write operations. As of yet, the model does not account for nodes entering or leaving the cluster.

PBS is a brilliant application of statistics to a computer science problem. I enjoyed hearing Peter Bailis, a member of the research team, describe the research at a Basho-sponsored meet-up this Tuesday. To learn the details of PBS, visit the team’s web site. If your business depends on the tradeoffs of eventual consistency, this research is of tremendous importance.

Posted in Uncategorized

Exploring NoSQL: Fine Tuning CAP with Riak

In previous posts, I explored MongoDB and Redis. MongoDB, a document-oriented database, allows developers to work with JSON documents in interesting ways. Redis offers developers a set of fundamental data structures upon which to build applications. Riak, in contrast, offers only the simplest of interfaces: put a key-value pair and get a value by its key. Values are simple binary strings, of which Riak understands nothing. It cannot operate on these values as JSON objects or sets or hashes. It cannot even increment a scalar numeric value. But Riak does provide developers with the power to fine tune the trade-offs between consistency, availability, and partition tolerance.

To understand the why of Riak, let us review the CAP Theorem, the conjecture that distributed databases must sacrifice one of three desirable characteristics: consistency, availability, or partition tolerance. Partitioning, in this context, means a network failure that prevents some of the nodes in a database from communicating with others even though each node may remain accessible to some clients.  At a large enough scale of distribution, with hundreds or thousands of nodes scattered around the globe, the possibility of partitioning becomes real enough that it must be anticipated. And if partitioning is anticipated, it follows that it may not always be possible to distribute a write across all nodes such that the value remains consistent. And if a write cannot be distributed across all nodes, the database must either reject the write, making the system unavailable, or commit the write to only a subset of nodes, making the system inconsistent.

So scalability demands distribution, and distribution demands trade-offs, trade-offs that developers must understand, control, and accommodate. Riak enables developers to fine tune these trade-offs through three numeric dials: N, W, and R. Each value in Riak will be replicated to N nodes. Each write will be considered a success once the data has been replicated to W nodes. And each read will return successfully after results are returned from R nodes. Adjusting these three values creates data stores of fundamentally different natures.

The difference by which N exceeds W defines how many nodes may become unavailable before a write would fail. The greater this difference, the greater the availability, but the greater the likelihood of inconsistency. Likewise, the difference by which N exceeds R defines how many nodes may become unavailable before a read would fail. The lower R, the greater the availability, the greater the load capacity, the lower the latency, but the greater the chance of inconsistency. If W plus R is greater than N, then the database achieves a form of consistency; because of the overlap, the read will include at least one node holding the latest value. [See subsequent post that clarifies this point.]

If a network partitions such that more than one partition contains at least W servers, it is possible for more than one client to update the same value, creating separate but valid versions. Riak refers to these forked versions as siblings. And while Riak can resolve the conflict in favor of the latest write, it could also be configured to return multiple siblings, leaving the work of resolving the conflict to the client application, which makes sense because the client understands the business logic.

And how does Riak distribute values to N nodes? Riak hashes each key using a consistent algorithm, producing an integer that falls within one of many equally sized partitions of a 160-bit integer space, an addressing system Riak refers to as a ring. Each Riak vnode (one physical node hosts multiple vnodes) claims one partition, making it responsible for storing values with key hashes falling within the address range of that partition. Multiple vnodes will claim the same partition, which forms the basis of replication. Each node tracks the vnodes available in the cluster and their respective partitions, making it possible for any node to direct a write across the necessary number of vnodes.

Riak supports three application programming interfaces, a native Erlang interface, a Protocol Buffers interface, and a RESTful HTTP interface. The REST interface provides a really simple means to put and get values. And Basho, the company behind Riak, provides drivers for several popular languages: Erlang, Javascript, Java, PHP, Python, and Ruby.

To get started setting up a development server and exploring the API, I suggest following along with the Riak fast track tutorial.

Related Posts:
Exploring NoSQL: MongoDB
Exploring NoSQL: Memcached
Exploring NoSQL: Redis
Exploring NoSQL: Couchbase

Posted in NoSQL, Riak

MongoDB Aggregation Framework: The SQL of NoSQL

No, it’s not really SQL. Its syntax resembles JSON. It returns a collection of JSON documents. But the Aggregation Framework addresses for MongoDB what many have found lacking in non-relational databases, the capability for ad hoc queries that SQL performs so well.

Actually, MongoDB already supports a variety of query mechanisms: queries on object properties, regular expression queries, and complex queries using JavaScript methods. And for those aggregation queries for which a SQL developer would use a group by clause, a developer using MongoDB could use MongoDB’s built-in facilities for MapReduce, a programming pattern for grouping data into buckets and iterating through each bucket.

But JavaScript-based queries require a developer to write a function. And MapReduce requires two functions, one for the map (grouping) and one for the reduce (analysis of the group). While doable, it is quite a bit of work compared to the ease of SQL. And it is this problem that the Aggregation Framework addresses.

The Aggregation Framework provides a declarative syntax for creating query expressions. The declarative statements work together like piped commands on a UNIX shell. The statements very much resemble their SQL counterparts. There is a match statement to identify records for inclusion, a sort statement for sorting, and a group statement for grouping. Grouping supports all of the aggregation functions one would expect: average, first, minimum, maximum. And there are all of the string and date extraction statements similar to SQL.

I found the unwind statement particularly interesting. Unwind pulls out elements from an array. MongoDB documents are JSON objects, which may include a hierarchy of child objects, including arrays of objects. The unwind statement returns a copy of the parent document for each element within the child array, but instead of the entire array, each document displays only the value of one array element. The result looks much like you would expect from a SQL join.

The Aggregation Framework is not yet included in MongoDB, though it is targeted for inclusion in version 2.2 due out in March. Last night at the SF Bay MongoDB Meet-up, Chris Westin of 10gen gave a preview and demo to the crowd. For more information, see his slide deck at http://www.slideshare.net/cwestin63/mongodbs-new-aggregation-framework.

For any organization considering NoSQL databases, the Aggregation Framework will certainly ease the SQL to NoSQL transition.

Posted in MongoDB, NoSQL

Exploring NoSQL: Redis

Redis differs so dramatically from traditional databases that you’d be forgiven for not recognizing the family resemblance. Indeed, it’s possible to run Redis as a key-value memory cache, mimicking the functionality of Memcached, which itself is not a database. Like Memcached, Redis neither indexes nor searches upon values. And like Memcached, Redis may store anything as a value—a string, an integer, a serialized object, a video.

So what makes Redis a databases? It supports persistence. Actually, Redis supports two modes of persistence. One mode is snapshotting. At regular time intervals, Redis takes a snapshot of the data in its memory and stores it in an RDB file, a compact, point-in-time representation of the data. The other mode, AOF (append-only file), persists changes to a file either per command or at a regular interval. According to the Redis documentation, persisting per command is slow, but reliable. Redis recommends persisting once per second to provide good performance and reasonable reliability (at most, one second worth of data can be lost). It is possible to turn on one, both, or neither of these persistence options. Using both together would most closely match the persistence of a relational database.

Moreover, Redis does support transactions. Redis executes each command within its own transaction. The increment command, which both retrieves and adds to a value, is guaranteed to operate atomically. No two clients may retrieve and increment a value in such a way that one increment overwrites the other. And Redis supports multi-command transactions. A programmer need only issue a multi command, the set of commands required in the transaction, and a then an exec command, which triggers the running of the commands within a single transaction.

Redis also supports replication from a master to one or more slaves. However, unlike MongoDB, Redis does not support the notion of a replication set. If a master fails, the system will not fail over without intervention. The Redis team may include automatic failover in Redis Cluster scheduled for release by the end of 2012. (See http://redis.io/download.)

Unlike MongoDB and Memcached, Redis does not support scalability in an automated fashion. In general, scaling a key-value data store is accomplished through distributing objects across multiple servers based on keys. Memcached automates this functionality through hashing algorithms implemented in client drivers. Redis provides no such automatic key distribution. Application programmers need to take on this work themselves. Again, Redis Cluster may include this capability.

While Redis, like Memcached, neither indexes nor queries on values, it is not quite accurate to say Redis understands nothing about the values stored in its memory. It cannot parse JSON documents in the manner of MongoDB, but Redis provides several data types of its own, all highly useful and familiar to any student of computer science. Indeed, the documentation describes Redis as a “data structures server,” obviously not a phrase dreamed up by a marketing executive.

Each data type includes a set of commands traditionally associated with the data structure. Here is a complete list of the types and a partial list of the commands supported by each. Note that each command takes a key as its first argument.

  • The String data type stores binary data.
    • SET         Associates a value with a key
    • GET        Retrieves value based on a key
    • DEL         Removes a value based on a key
    • INCR      Increments a numeric value
    • DECR     Decrements a numeric value
  • The List data type stores a sequence of ordered elements.
    • RPUSH  Adds value to end of list
    • LPUSH   Adds value at beginning of list
    • LRANGE               Retrieves values from beginning to end of specified range
    • LLEN      Length of list
    • RPOP     Return and remove item from end of list
    • LPOP     Return and remove item from beginning of list
  • The Set data type stores a set of unordered, non-duplicate elements.
    • SADD   Add element to set
    • SREM   Remove element from set
    • SUNION            Combine two sets
    • SISMEMBER    Check whether item is a member of the set
    • SMEMBERS     Return all members of set
  • The Sorted Set stores a set that is sorted by an associated score.
    • ZADD     Adds element to a set with a score
    • ZRANGE               Returns elements from set from beginning to end of specified range
    • SRANGEBYSCORE             Returns elements from set based on score
  • The Hash data type maps string fields and string values.
    • HSET      Set a value based on a key
    • HGET     Retrieve a value based on a key
    • HDEL      Delete a value based on a key
    • HKEYS   Retrieve all keys
    • HVALS   Retrieve all values

The easiest way to get a sense of the API is to try out the online tutorial at http://try.redis-db.com. Redis, which is free and open source, has really clear documentation at http://redis.io/documentation. And getting Redis up and running on Ubuntu Linux took me just a few minutes (http://redis.io/download).

Related Posts:
Exploring NoSQL: MongoDB
Exploring NoSQL: Memcached
Exploring NoSQL: Riak
Exploring NoSQL: Couchbase

Posted in NoSQL, Redis

Exploring NoSQL: Memcached

Memcached might seem an odd place to venture in an exploration of NoSQL databases. It is not a database. It provides no persistence. It purges items from memory to free space as needed. And there is no replication. If data gets lost, an application retrieves it anew from a persistent data store.

Rather than a database, Memcached is a distributed memory cache of key-value pairs. Its application programming interface follows the pattern of a hash table. Using a key, programmers set and get values, which could be anything. Yes, memcached increments and decrements numeric values, but in general it understands nothing about the structure of values stored in its memory. It neither indexes nor searches based on values. To use a value, a programmer must retrieve it via a key and convert it to an object defined in the programming language.

To use Memcached, install and run the service on one or more servers. (I literally had it up and running on Ubuntu in minutes.) And install the driver for your programming language of choice. In code, indicate the Memcached servers in use and begin to set and get key-value pairs. The client code included in the driver distributes your data across servers based on a hashing algorithm. The Memcached servers do not need to communicate with each other. The system is remarkably simple, which explains its appeal.

Memcached users include Twitter, YouTube, Flickr, Craigslist, and WordPress.com, the host of this blog. Indeed, you are now reading words that were likely cached in Memcached.

So why consider Memcached in this exploration of NoSQL? I take this detour because as we explore NoSQL databases, it will be useful to compare their functionality to Memcached, with Memcached serving as an example of extreme simplicity. Hopefully, this point will become more apparent in my next post as I explore Redis.

Related Posts:
Exploring NoSQL: MongoDB
Exploring NoSQL: Redis
Exploring NoSQL: Riak
Exploring NoSQL: Couchbase

Posted in NoSQL

Harmonic: A MongoDB Success Story

After writing my last post on MongoDB, I attended a meet-up at the Mozilla office in San Francisco to hear the tale of a real company in the process of migrating from Microsoft SQL Server to MongoDB.

The company, Harmonic, sells enterprise software for managing workflows around video. Videos come in, go through checks, conversions, and other processing, and get distributed over multiple channels.  (Ok, vast simplification, but that’s the gist of it.) The architecture consists of a GUI and other management tools on the front-end, a set of services for processing videos on the back-end, and a workflow engine that orchestrates the process. The workflow engine stores its state in a database, and that database had been Microsoft SQL Server.

The marketing staff demanded that engineering reduce complexity for customers, increase scalability, and keep costs low. Nick Vicars-Harris, who manages the Harmonic engineering team, experimented with MongoDB. It took just a few days to tweak the data layer, written in C# and utilizing LINQ, to work with MongoDB rather than SQL Server. According to Vicars-Harris, Harmonic removed code that had been needed for object relational mapping, refactored, and produced more intuitive code. Rather than normalizing workflow state across over twenty tables, Harmonic could now store each job and its related tasks in a single document. In addition to removing complexity, the solution passed the test for scalability.

Harmonics also took advantage of the MongoDB simplified deployment model to create what Vicars-Harris calls smart nodes, nodes that communicate with each other and self-configure, a solution that met the requirements for simplified deployment and maintenance.

After listening to the presentation, I was impressed with the ease of transition from SQL to NoSQL. Clearly, the workflow use case fits in well with document-oriented databases.

Posted in MongoDB, NoSQL, Workflow
%d bloggers like this: