Harmonic: A MongoDB Success Story

After writing my last post on MongoDB, I attended a meet-up at the Mozilla office in San Francisco to hear the tale of a real company in the process of migrating from Microsoft SQL Server to MongoDB.

The company, Harmonic, sells enterprise software for managing workflows around video. Videos come in, go through checks, conversions, and other processing, and get distributed over multiple channels.  (Ok, vast simplification, but that’s the gist of it.) The architecture consists of a GUI and other management tools on the front-end, a set of services for processing videos on the back-end, and a workflow engine that orchestrates the process. The workflow engine stores its state in a database, and that database had been Microsoft SQL Server.

The marketing staff demanded that engineering reduce complexity for customers, increase scalability, and keep costs low. Nick Vicars-Harris, who manages the Harmonic engineering team, experimented with MongoDB. It took just a few days to tweak the data layer, written in C# and utilizing LINQ, to work with MongoDB rather than SQL Server. According to Vicars-Harris, Harmonic removed code that had been needed for object relational mapping, refactored, and produced more intuitive code. Rather than normalizing workflow state across over twenty tables, Harmonic could now store each job and its related tasks in a single document. In addition to removing complexity, the solution passed the test for scalability.

Harmonics also took advantage of the MongoDB simplified deployment model to create what Vicars-Harris calls smart nodes, nodes that communicate with each other and self-configure, a solution that met the requirements for simplified deployment and maintenance.

After listening to the presentation, I was impressed with the ease of transition from SQL to NoSQL. Clearly, the workflow use case fits in well with document-oriented databases.

Posted in MongoDB, NoSQL, Workflow | Leave a comment

Exploring NoSQL: MongoDB

MongoDB is one of the most popular of open source NoSQL databases. Supported by 10gen and boasting a long list of deployments including Disney, Craigslist, and SAP, MongoDB has a remarkably simple application programming interface (API) and all the tools necessary for massive scalability.

MongoDB is a document-oriented NoSQL data store, but the word document might mislead. Really, MongoDB stores objects much like an object database. More technically, MongoDB stores BSON documents, which stands for Binary JSON. JSON stands for JavaScript Object Notation, which is a text-based format for storing structured data, much like XML but less verbose, with more colons and curly braces than angle brackets.

MongoDB stores each object as a document, which might be thought akin to a record in the relational world. While there is no schema in MongoDB, that is no defined set of columns, developers would normally store like objects (objects created from the same class) together in a collection, which might be thought of as a table. A database would typically consist of several such collections. And while there are no relationships between collections, a JSON object can itself contain a hierarchy of other JSON objects or fields that link to objects in other collections, so it is possible to model a variety of relationships.

Developers access MongoDB through a driver, which maps a MongoDB document to a familiar construct in the developer’s language of choice. JSON objects, as the name implies, map easily to JavaScript objects. In Python, a document maps to a dictionary. In C#, a document maps to a specially defined class called a BsonDocument. Regardless of the language, the API is quite straight forward.

Object databases remove the grunt work of mapping classes to relational data models. But object databases never caught on, perhaps because of the difficulty of ad hoc queries. MongoDB does provide a variety of query mechanisms. It allows for queries by object properties, queries based on regular expressions, and complex queries using JavaScript methods. Whether any of these approaches satisfies requirements for ad hoc queries depends on the specific application scenario.

MongoDB achieves scalability through sharding, which divides objects in a collection between different servers based on a key. If the developer defines zip code as the shard key, for example, a customer object with a New York City zip code might be stored on a different server than a customer object with a San Francisco zip code. MongoDB handles the work of distributing the data.

Sharding is distinct from replication. Each MongoDB shard can be configured as a replicate set, which provides asynchronous master-slave replication with automatic failover and recovery of member nodes. In production, a replica set generally consists of at least three nodes running on separate machines, one of which serves as an arbiter. The arbiter breaks ties when the cluster needs to elect a new primary node. Drivers automatically detect when a replica set’s primary node changes and begin to send writes to the new primary. The replication process uses logs in much the same way as relational databases. And while non-primary replicas could be used for reads to speed performance, the primary purpose of replication is reliability. (Paragraph updated 3 Feb 2012.)

Reliability brings to mind transactions. Relational databases support transactions across multiple records and tables; MongoDB restricts transactions to single documents. While this might appear overly restrictive, it could be made to work in many scenarios. Recall that a document can include a hierarchy of objects. If an application’s data is modeled such that all of the data requiring all-or-nothing modification resides within one document, the MongoDB approach would suffice.

However, this restriction on transactions calls attention to the design objectives of MongoDB: speed, simplicity, and scalability. Expanding transactions to encompass multiple objects stored across shards would significantly impact performance leading to complex dead-lock problems. Indeed, by default, a save function call to the MongoDB returns immediately to the application without waiting for confirmation that the save was successfully persisted. If a networking or disk failure prevents the write, the application continues without awareness of the error. However, the MongoDB API does provide options for safe writes that wait for a success response. There are even options to specify how many replication slaves must get updated before the save is considered a success. So while speed is the default, reliability is a possibility.

While I mentioned earlier that MongoDB provides drivers for a variety of languages, it holds particular appeal to JavaScript devotees. JSON documents were designed to store JavaScript objects. JavaScript is the MongoDB language for complex queries. And the command line tool for managing MongoDB is built on top of the JavaScript shell. So if you have mastered JavaScript for the coding of dynamic web pages, MongoDB provides an opportunity to expand its use.

I recommend visiting MongoDB.org and trying out the online shell. In a few minutes, you’ll get a sense of the API. Then take a look at the tutorial. And to experiment further, download and install MongoDB for yourself. (I managed to install it on Windows 7 in a few minutes, but somehow got stuck installing the package on Ubuntu.)

Related Posts:
Exploring NoSQL: Memcached
Exploring NoSQL: Redis
Exploring NoSQL: Riak
Exploring NoSQL: Couchbase

Posted in MongoDB, NoSQL | 3 Comments

NoSQL: The Joy is in the Details

Whenever my wife returns excitedly from the mall having bought something new, I respond on reflex: Why do we need that? To which my wife retorts that if it were up to me, humans would still live in caves. Maybe not caves, but we’d still program in C and all applications would run on relational databases. Fortunately, there are geeks out there with greater imagination.

When I first began reading about NoSQL, I ran into the CAP Theorem, according to which a database system can provide only two of three key characteristics: consistency, availability, or partition tolerance. Relational databases offer consistency and availability, but not partition tolerance, namely, the capability of a database system to survive network partitions. This notion of partition tolerance ties into the ability of a system to scale horizontally across many servers, achieving on commodity hardware the massive scalability necessary for Internet giants. In certain scenarios, the gain in scalability makes worthwhile the abandonment of consistency. (For a simplified explanation, see this visual guide. For a heavy computer science treatment, see this proof.)

This initially led me to assume that NoSQL makes sense only for the likes of Facebook and Twitter. The rest of us who seek something less than world domination and who associate consistency with job security may as well stay within the safe and comfortable realm of relational database, which have certainly passed the test of time.

However, I’m starting to question that assumption. Clearly, relational databases still make sense for many applications, especially those requiring strict transactions and complex ad hoc queries. Relational databases will certainly remain the backbone of financial and ERP systems. But I’m now wondering whether NoSQL might fit quite well for many other applications.

When I say NoSQL, however, I’m not really saying anything. Once computer scientists freed themselves from the principles of relational databases, an astounding creativity burst forth. The only thing that NoSQL databases have in common is that they are not relational. So it is not a choice between SQL and NoSQL, but rather a choice between SQL and a wide diversity of other options.

Wikipedia does a good job categorizing the many NoSQL databases now available. But that should just be taken as a starting point. The only way to appreciate the range of choices is to explore each one, looking over its documentation, playing with code, and experimenting. The value of NoSQL is not in the theory, but in the specific character of each NoSQL database.

And so I plan to spend time this year exploring and posting about some of the many NoSQL options out there. I’ve already started a post on MongoDB. Stay tuned for more. And if you have any suggestions for which database I should look into next, please make a comment.

Related Posts:

Exploring NoSQL: MongoDB

Exploring NoSQL: Memcached

Exploring NoSQL: Redis

Posted in NoSQL | 3 Comments

Cloud Foundry Evolves

Cloud Foundry, the open-source Platform-as-a-Service (PaaS) solution from VMware, continues to gain community support and evolve toward a more diverse, enterprise-ready platform. Last night at the Silicon Valley Cloud Computing Group meet up in Palo Alto, VMware engineers and representatives from several community partners spoke of recent progress and future plans.

Building on Cloud Foundry’s extension framework for languages and services, Uhuru Software has added .Net support to Cloud Foundry, enabling .Net developers to create applications in Visual Studio and deploy them directly to a Cloud Foundry-based private or public cloud. AppFog has added PHP support, and ActiveState has added Perl and Python support. Jaspersoft has extended Cloud Foundry with BI support, including user-friendly wizards for building reports and dashboards and direct support for the document-oriented MongoDB as a data source.

Scalr, whose founder, Sebastian Stadil, organizes the cloud computing group, demonstrated tooling for deploying a Cloud Foundry cluster. The graphical, web-based tool builds up a configuration for each server and calls web services to spin up the instances.

And VMware itself continues to make major contributions to Cloud Foundry. Patrick Chanezon and Ramnivas Laddad from VMware demonstrated Micro Cloud Foundry, a Cloud Foundry cluster with all components running on one virtual machine. This capability makes it possible for developers to spin up a PaaS instance on their laptop, deploy an application, and debug. Using an Eclipse plugin, Laddad gave a demo of debugging a cloud application that mirrored the experience of debugging traditional applications.

During a closing panel, VMware and partner representatives clarified the distinction between CloudFoundry.org and CloudFoundry.com. CloudFoundry.org hosts the open-source software development project, which enables organizations to run their own private or public PaaS cloud. This codebase will grow to support multiple languages and services. CloudFoundry.com is a public instance of Cloud Foundry run by VMware. Like other hosted instances of Cloud Foundry, CloudFoundry.com supports only a subset of the languages and services provided for by the open-source Cloud Foundry code. Despite the expansion of the code base, hosting providers must limit their offerings to services that they have the operational expertise to support.

In light of the rapid growth and expanding ecosystem, Jeremy Voorhis, a senior engineer at AppFog, suggested that VMware create an independent governance body to direct the future development of Cloud Foundry and to mediate potential conflicts between contributors. A few meet-up participants supported the suggestion. While all agreed that VMware has done a fabulous job of starting the project and building an ecosystem, those who raised the suggestion were concerned that conflicts were inevitable and that it would be better to build up a governance system in preparation. Representatives from VMware responded that they did not oppose the idea but did not consider governance a priority given the platform’s early stage of development.

The meet up made one thing clear, that extensiblity (see my post from last May) has made Cloud Foundry into a dynamic platform that has caught the attention of the open-source community.

Posted in Cloud Computing, Cloud Foundry | 1 Comment

Extreme Transaction Processing in SOA

On the LinkedIn Mountain View campus, Anirudh Pandit, a senior architect at Oracle, spoke last night at the SVForum’s Software Architecture and Platform SIG on the topic of Extreme Transaction Processing in SOA. Pandit proposed an architectural pattern for improving SOA performance and reviewed several case studies in which the pattern dramatically improved performance in enterprise SOA implementations.

While no longer generating the buzz that it once enjoyed, SOA (Services Oriented Architecture) remains an important integration strategy at large enterprises, perhaps now enjoying more real success than it did at the peak of it is hype cycle. Though implementations often prove costly and difficult, ultimately SOA provides the most conceptually compelling means to cut through the complexity and diversity of enterprise systems—systems encumbered by a mix of legacy and web-based applications, of behind-the-firewall systems with the external systems of business partners, and of applications accumulated through long histories of mergers and acquisitions.

Despite its conceptual advantages, Pandit pointed out that many SOA implementations suffer from poor performance and a lack of scalability. As messages move from service to service through an orchestration, the work of serializing and deserializing XML messages creates processing bottlenecks and maintaining transactions via relational database and data persistence creates disk IO bottlenecks.

Pandit proposed a solution based on two changes to the traditional SOA architecture. First, rather than passing XML messages with repeated serializations and deserializations, pass a token that each service may use to retrieve the message. Second, instead of persisting the message as it moves from service to service, cache the message to memory and assure message integrity by synchronizing the cache across multiple machines, preferably across machines in different data centers.

The solution, Pandit explained, does not work in all circumstances. By avoiding persistence to disk, the solution might fail to meet compliance requirements. By synchronizing the cache across many machines and data centers, the solution might introduce network bottlenecks and latency issues. But where it does fit, Pandit concluded, the solution overcomes performance problems in a plug-and-play manner without a need for the costly redesign of services.

While most recent conversations on scalability revolve around NoSQL, Pandit’s presentation was a reminder that the intelligent use of caching remains a viable option.

Stay tuned to the Software Architecture and Platform SIG for future events and the slide deck of Pandit’s presentation (not yet available at time of writing). The next meeting will be on the use of Hadoop at LinkedIn.

Posted in Architecture, SOA, SVForum | Leave a comment

Five Bad Reasons Not to Adopt Agile

While the advantages of agile appear obvious, I observe that many IT departments stick stubbornly to waterfall. Despite failure after failure, IT managers blame the people—staff members, vendors, users—rather than the process.

Waterfall organizes software projects into distinct sequential stages: requirements, design, coding, integration, testing. Despite its common sense appeal, waterfall projects almost always lead to unproductive conflicts. Most users cannot visualize a system specified in a requirements document. Rather, users discover the true requirements only late in the project, usually during user acceptance testing, at which point accommodating the request means revising the design, reworking the implementation, and re-testing.

In fact, users should change their minds. We all do as we learn. As users begin using a system, they begin to see more clearly how it could improve their jobs. We should welcome this learning. But under the waterfall approach, such discovery gets labeled as scope creep, it undermines the process, causes cost overruns, and leads to a frenzy of finger pointing.

Agile methods throw out the assumption that a complex system can be understood and fully designed and planned for up front. While there are a great diversity of agile methods, they almost all tackle complexity by breaking projects into small—one week to a month—iterations, each one of which delivers working software of value to a customer. Users get to work with and provide feedback on the system sooner rather than later.

So why do so many IT organizations insist on crashing over waterfalls again and again? Here are the most frequent refrains I hear from waterfall proponents:

“We need a fixed-fee contract to avoid risk.”

Fixed-fee contracts force projects into a waterfall approach that makes change cost prohibitive. While fixed-fee contracts in theory transfer risk from client to vendor, these contracts in reality increase the risk of a final product that meets specification but fails to achieve the business objectives. A far simpler alternative would be a time and material contract that requires renewal at the start of each iteration. The customer reduces risk by agreeing to less cost at each signing and by receiving working software sooner rather than later, which both achieves a quicker return on investment and assures the project remains in line with expectations.

“What’s this going to cost?”

To make the business case for a project, it is essential to nail down costs, which the waterfall method promises for an entire project as early as the requirements stage. And while I’m all for business cases, making decisions based on bogus data makes little sense. What we need is an agile business case, one that covers a single iteration and rests on meaningful cost and revenue data.

“I’m a PM and I need a schedule to manage to.”

Wedded to the illusion of predictability, many PMs identify their discipline with large, complex schedules, full of interdependent sequences of tasks stretching out over months, schedules so easy to create in MS Project but that so rarely work out in reality. These schedules, together with requirement documents the size of a New York City phone book, end up as just more road kill of the waterfall process.

 “Should we buy or build? We need a decision.”

Agile eschews big designs up front in favor of designs that evolve organically over time. But IT departments must often make certain big design decisions up front, such as whether to buy or build a system or component.

And, unfortunately, these sorts of decisions fall outside of the well-known agile frameworks such as Scrum or XP, which focus exclusively on software development. So it might be assumed that once a big up front decision is required, that agile no longer applies. Indeed, it does make sense that more requirements and more design is needed up front to avoid committing dollars to the wrong vendor.

Regardless, I’d argue that there are ways to keep a project agile even if certain decisions must be made up front.

1)      Keep in mind that there is no need to collect every requirement, just those needed for the decision.

2)      Take advantage of demo versions and commit a few early iterations to build out rapid prototypes on different products.

3)      Consider open source. Low cost means low commitment.

And once a decision to buy or build has been made, implementation can proceed according to agile principles. Most enterprise software requires substantial configuration and customization that can benefit as much as pure software development from agile approaches.

“Let’s just get on with the next project.”

Perhaps because of the pain and strained relationships that accompany failed waterfall projects, few organizations analyze the process that led them to failure. Everybody flees the sinking ship and moves onto the next project so fast that nobody reflects on the root causes of the disaster. I would think that if enough time were spent studying these root causes, it would be determined that the problem was the process, not the people.

Thoughts?

If you know of other objections to agile, I’d like to hear them. And if you have had success in introducing agile into IT organizations, please share your experiences.

Posted in Agile | 3 Comments

A Cloud Curriculum

Erik Bansleben, Program Development Director at the University of Washington, invited me and several other bloggers to participate in a discussion last Friday on UW’s new certificate program in cloud computing. Aimed at students with programming experience and a basic knowledge of networking, the program covers a range of topics from the economics of cloud computing to big data in the cloud.

I had a mixed impression of the curriculum. It looked a bit heavy on big data, an important cloud use case that would perhaps be better suited to a certificate program of its own. And the curriculum looked a bit light on platform-as-a-service and open source platforms for managing clouds such as OpenStack. Since the program is aimed at programmers, perhaps I’d organize it around the challenges of architecting applications to benefit from cloud computing.

But that is nitpicking. It is certainly important for universities to update course offerings based on current trends. And UW has put together an impressive panel of industry experts to guide the program.

Clearly, it’s difficult to put together a program around cloud computing, since there is so much debate around what constitutes cloud computing. And given that just about every tech firm out there has repositioned its products as cloud offerings, it would seem very difficult to achieve sufficient breath without giving up the opportunity for rigorous depth.

But any interesting new direction in the tech industry would pose such challenges. Indeed, perhaps it is these difficulties that make it worth tackling the topic.

For more information, see the UW website at http://www.pce.uw.edu/certificates/cloud-computing.html.

Posted in Cloud Computing | 1 Comment