My Node.js Misperceptions

After recently studying and playing with Node.js, I realized that I had somehow picked up a couple of misperceptions that were wildly off base:

  • Node.js is solely a platform for web development similar to one of the other popular server-side web development platforms, perhaps PHP or Ruby on Rails.
  • Node.js is innovative only in that it brings JavaScript to the server.

To clarify misperception one, while Node.js provides powerful support for web development, Node.js can be used to build any manner of application. Node.js applications may access file systems, networks, databases, and child processes. Just as with Python and Ruby scripts, you can run Node.js scripts from a command line and can pipe streams back and forth to other applications in the Unix style.

My second misperception turned out to be doubly wrong.

First, Node.js does not break new ground in bringing JavaScript to the server. Netscape started on the Rhino project back in 1997, a JavaScript engine implemented in Java and capable of running outside of the browser. Even Microsoft’s ASP web framework supported server-side JavaScript, a fact that I had long ago forgotten. And to top it off, Node.js does not actually execute JavaScript, for that it relies on Google’s V8 JavaScript engine.

Second, Node.js is indeed very innovative, but its innovation lies not in its server-side implementation of JavaScript, but in how it handles concurrency—the situation wherein new requests arrive before earlier requests have completed.

Of course, this is an old problem, and all modern operating systems offer a built-in solution—multi-threading. A multi-threaded server makes a system call to generate a new thread for each request that it receives. That thread handles the request to completion while other threads handle other requests. The OS manages all of these threads, sharing out slices of CPU cycles to each as it is ready.

Obviously, multi-threading still works as it has for decades, but it does have its problems when it comes to massively scalable web applications. Managing threads is not free. Each has its own associated process, registry values, program counter, call stack. As the number of threads increases, the OS spends more and more CPU cycles managing rather than running the threads. And with web applications, threads spend the vast majority of their time waiting for IO, whether that be calls to files, databases, or network services.

Node.js takes a radically different approach, avoiding the need for OS threads by simply refusing to wait. Rather than making blocking IO calls, wherein the thread stalls waiting for the call to return, almost all IO calls in Node.js are asynchronous, wherein the thread continues without waiting for the call to return. In order to handle the returned data, code in Node.js passes callback functions to each asynchronous IO call. An event loop implemented within Node.js keeps track of these IO requests and calls the callback when the IO becomes available. Managing the event loop costs less than managing multiple threads, as it only requires tracking events and callbacks rather than entire call stacks.

Each Node.js process is single threaded, but it can be scaled to multiple processes and multiple machines just as traditional multi-threaded servers.

One might also argue that Node.js as a platform for server-side web development is innovative in its lack of abstraction; the Node.js programmer handles http requests by forming http responses (what the http protocol is all about), rather than creating pages (PHP, JSP, Asp.Net) or writing models, views, and controllers (Ruby on Rails). Personally, I have a preference for lighter frameworks or at least frameworks that do not force me into certain patterns so I certainly find Node.js appealing despite my initial misperceptions.

Posted in Node.js

JavaScript Templating as a Cross-Stack Presentation/View Layer

At last night’s Front End Developers United meetup in Mountain View, Jeff Harrell of PayPal and Veena Basavaraj of LinkedIn spoke about the use of Dust.js within their organizations. LinkedIn and PayPal chose Dust.js from among the many JavaScript templating frameworks for reasons that Veena describes in her post Client-side Templating Throwdown. But the real interesting point that both speakers made was not so much about Dust.js, but how standardizing on a JavaScript templating framework across different stacks accelerated development cycles.

The presentation layer in web development generally relies on some sort of templating, which mixes html markup with coding instructions. Traditionally, this is done through a server-side technology such as PHP, JSP, or ASP.Net. More modern server-side MVC frameworks such as Ruby on Rails use templating in the view layer, either Erb or HAML.

With the rise of single-page applications and JavaScript MV* frameworks, JavaScript templating frameworks are growing in popularity. When using AJAX without an MV* framework, one would make an AJAX call to retrieve data from the server in JSON format, insert it into a template, and display the resulting HTML. When using an MV* framework (and the observer pattern), updating a model automatically causes associated views to render using the updated model and a template. One of the most popular MV* frameworks, Backbone.js, uses Underscore.js templating by default but could work just as well with others. (Veena’s post referenced above lists many of the available JavaScript templating frameworks.)

In large organizations that have been around for some time, there are bound to be multiple platforms in use, some representing a future ideal state, others representing legacy code. PayPal runs on C++ and Java and is moving increasingly toward Node.js. And, of course, most web applications use JavaScript in the browser to provide user interactivity. What this means is that an organization could end up using several templating frameworks. Similar content gets rendered using different templates, a situation that risks inconsistency in the user experience.

One possible answer to this dilemma would be to standardize on a single stack, but this is not possible for a number of reasons. Neither PayPal nor LinkedIn can run all of their code in the browser nor could they not run any code in the browser. For security reasons, not all code can run using JavaScript in the browser;  for user experience reasons, some code must run in the browser. Moreover, migrating a complex code base from one platform to another is a major undertaking.

Instead, PayPal and LinkedIn chose to use Dust.js as a common templating framework across all stacks. Directly using Dust.js in the browser or within Node.js poses no problem as it’s a JavaScript framework. From Java code, Dust.js can be accessed through Rhino, a Java-based implementation of JavaScript. And C++ code can access Dust.js through Google’s V8 JavaScript engine. Both Rhino and V8 can be run on the server. What this means is that the entire code base can make use of the same templates, assuring a more consistent user experience and more rapid development cycles.

Since only JavaScript runs in the browser, only a JavaScript-based templating framework could have achieved this unity.

Thanks to Veena and Jeff for the great presentations and to Seth McLaughlin of LinkedIn for organizing the event.

Posted in Dust.js, JavaScript

Will NoSQL Equal No DBA in Enterprise IT?

During the recent CouchConf in San Francisco, Frank Weigel, a Couchbase product manager, touted the benefits of schemaless databases: without a schema, developers may add fields without going through a DBA. The audience of developers seemed pleased, but I wondered what a DBA might think.

Based on my experience, the roles of DBAs in enterprise IT vary. Some engage in database development. Others focus exclusively on operations. But DBAs generally own production databases; all production schema changes go through them. This control would go away under NoSQL. In the schemaless world of NoSQL, developers have more power and more responsibility. Through their code, developers control the schema, access rights, and data integrity.

What is left for DBAs? There is always operations: monitoring performance and adjusting resources as needed. But these are general system administration tasks, not necessarily requiring experts trained in relational databases. So will NoSQL put DBAs out of work? Will the DBAs, with their cubicle walls covered in Oracle certifications, stand up against this invader?

Let’s not get ahead of ourselves. I’ve not seen any evidence to suggest that NoSQL is taking hold in enterprise IT. With the exception of large, customer-facing systems, enterprises do not need the scalability promised by NoSQL. And many key enterprise systems require transactions that span records, a feature lacking in NoSQL systems. While document-oriented and graph databases might fit well for certain use cases, the value proposition for NoSQL in the enterprise has yet to become compelling.

And despite the disadvantages of fixed schema for developers, the fixed schema of relational systems has a larger value within enterprises. Ideally, these schema define in one location the rules of data integretity, which makes it easier to audit all changes.

While I could foresee conflicts between developers and DBAs over NoSQL in enterprises, that does not need to be the case. Most enterprise systems will continue to run on relational databases. The DBA role of managing these systems will not go away anytime soon. If enterprises do see any value in NoSQL for certain use cases, they should define policies around which sorts of applications might use this new technology. NoSQL, as many have commented, may well stand for Not Only SQL, in which case it can coexist with DBAs within the confines of corporate policies and procedures.

Posted in NoSQL

Thoughts on CouchConf: Data Modeling for Flow

Last Friday, I had the pleasure of attending CouchConf in San Francisco to learn about the new and exciting features of Couchbase 2.0. With the new release, Couchbase becomes a full document oriented database . Meanwhile, Couchbase keeps its focus on performance, reporting some very impressive performance statistics comparing Couchbase to other NoSQL databases. Of all the exciting presentations at the conference, I most enjoyed the session on data modeling by Chris Anderson, Chief Architect for Mobile at Couchbase.

Document-oriented databases such as Couchbase are schemaless, making data modeling more flexible. While in the relational world there is generally one right answer for how to divide data across tables to achieve fifth normal form, in the document-oriented world there are fewer clear cut rules. For example, whether blog comments get stored with the post or in separate documents depends more on expected usage than any inherent constraints of the system or rigid design rules.

Without the schema constraints of the relational model, developers become free to think more about data flow. Chris Anderson uses phrases such as “emergent schema” and “runtime-driven schema” to describe this new emphasis. He concludes “thinking about throughput, latency, update and read patterns is the new data modeling.”

In my experience, emphasizing data flow, or data in motion as opposed to data in rest, makes developers better at understanding user needs. In my early career, when playing the role of business analyst, I went too far in allowing the relational model to shape my thinking about business requirements. With a relational schema always in the back of my mind, I focused my questions on whether data relations were one-to-one, one-to-many, or many-to-many. While I did write use cases, normalization rules shaped my thoughts.

Then several years ago, I worked with a colleague who focused more on process. He’d start conversations with “…now walk me through your process.” And that led us to a more fruitful discovery. At some point, I’d ask questions to better understand the data model, but we were in a better place when I started to ask those questions.

Of course, developers have always thought about use cases and update and read patterns. But if Chris Anderson is right and document databases make it easier to focus on data in motion rather than data at rest, then I think we’ll do a better job of thinking about user needs.

Posted in Couchbase, NoSQL


This Monday at the Silicon Valley NewSQL meetup in Mountain View, Michael Stonebreaker took turns bashing both the established relational databases (Oracle, DB2, SQL Server, PostgreSQL)  and the NoSQL newcomers (MongoDB, Cassandra, Riak), proposing a third alternative, VoltDB, a NewSQL database.

Stonebreaker—a leading researcher in the field of database design, former UC Berkeley professor and now MIT professor, winner of the IEEE John von Neumann Medal, and a designer of PostgreSQL—argued that the established databases have become legacy bloatware incapable of scaling to modern requirements without complete redesign. According to Stonebreaker’s research, these systems, all of which follow a similar design, use only a small percentage of CPU cycles (about 4%) on useful work. The bulk of CPU cycles go to overhead, divided fairly evenly into four categories of about 24% each:

  • Managing the buffer pool of disk pages cached in memory
  • Multi-row locking for transactions
  • Latching memory objects such as b-trees to prevent corruption in a multi-threaded environment
  • Write-ahead logging to disk

The NoSQL databases, according to Stonebreaker, solve these problems, but they do so by jettisoning SQL and ACID. Giving up SQL, Stonebreaker argued, makes no sense. The SQL standard has proven itself as a time-saving high level language that successfully depends on compilation to generate low level commands. Going backwards to row-level commands and unique APIs for each database, Stonebreaker claimed, is comparable to giving up C for assembler.

Stonebreaker also argued against giving up ACID, a requirement (or potential requirement) for almost all applications. If a database does not provide ACID, application developers will need to write this complex code themselves.

Stonebreaker proposed instead his product, VoltDB, a relational database that supports ACID and most of the SQL standard. VoltDB avoids the overhead of buffer management by keeping all data in memory. It avoids the overhead of row locking and memory object latching by using a single thread per partition. Only one thread touches memory objects, and transactions run sequentially on the one thread. And instead of write-ahead logging of data, VoltDB takes periodic snapshots of the database and logs only commands, which is faster but still capable of rebuilding the database from disk in case of failure. (See the VoltDB Technical Overview for more details.)

Like most of the NoSQL databases, VoltDB supports scalability across commodity hardware by sharding data based on keys. According to Stonebreaker, the choice of key is critical to performance, as joins and transactions that cross partitions degrade performance, a problem that cannot be solved even by eliminating the overhead of traditional RDMS. VoltDB makes scaling possible, but application developers must still give careful thought to how to partition data so that most operations only touch a single partition.

One could argue that this latter admission proves the NoSQL case against relational databases, namely that a database supporting ACID cannot scale. VoltDB scales only as long as transactions do not cross partitions. In a sense, VoltDB can be thought of as many small, fast databases that support ACID or one large database that supports ACID but does not scale. In other words, VoltDB does not solve the CAP dilemma.

Certainly, VoltDB will make sense for certain use cases, where there is a need for lightning speed and transactional integrity, where data can be sharded into largely autonomous partitions, and where VoltDB’s only partial implementation of the SQL standard fulfills requirements. But VoltDB will not replace traditional RDMS anytime soon, as it lacks much of the functionality that enterprises expect, bloatware though that might be.

Nor will VoltDB eliminate the demand for NoSQL, because many organizations will find a NoSQL database out there that fits well with its specific requirements. If all you need is a key-value store, why not choose a database that specializes in this function. If your data takes the shape of a graph, why not choose a database tailor made to the purpose.

Moreover, Stonebreaker overstates the case against NoSQL APIs. Yes, SQL is a proven high level language for data access, but it does not always fit well with object models used by application developers, which is why organizations have turned to solutions for object-relational mapping.  In many ways, objects map more intuitively to document-oriented databases than to SQL. Besides, a central tenet of NoSQL is the embrace of variety, the rejection of the one size fits all mentality. With that in mind, the diversity of NoSQL APIs is a benefit; the diversity allows developers to choose the one or more APIs that best fit a specific requirement.

Whether or not we accept all of Stonebreaker’s claims, the VoltDB retake on the traditional RDMS makes clear that these are exciting times in the world of databases.

Posted in NewSQL, NoSQL

Exploring NoSQL: Couchbase

In February 2011, Membase, Inc. and CouchOne merged to combine the strengths of their two open-source NoSQL projects, Membase and CouchDB. The joint team released Couchbase 1.8 in January, 2012 as an upgrade to Membase 1.7. Version 2.0 is now available as a developer preview. Meanwhile, CouchDB lives on as an independent project under Apache.

Membase, developed by several leaders of the Memcached project, maintains protocol compatibility with Memcached, making it possible to plug in Membase as a replacement for Memcached without rewriting application code. Like Memcached, Membase is a key-value store supporting set and get operations. But Membase adds persistence and replication. While Memcached removes items from cache when it runs out of memory, Membase makes room by moving items to disk. And while Memcached stores each item on a single server, Membase replicates items to additional servers, first to memory, and then to disk as needed.

CouchDB is a document-oriented database that stores values as JSON documents. Like Riak, CouchDB is written in Erlang, a language designed for distributed, concurrent, fault-tolerant, soft real-time systems.

The merged result, Couchbase, combines the capabilities of both products, serving as either a key-value store or as a document-oriented database, depending on the use case and values being stored. Like Membase, Couchbase can replace Memcached without a code rewrite, meaning it can function as a key-value store in the manner of Memcached. Indeed, the Couchbase API supports the set and get of arbitrary binary values without forcing values to conform to JSON. And while internally all items get stored as JSON with arbitrary values kept as attachments, this need not matter to a developer. Indeed, there are quite a few use cases in which it would make sense to use Couchbase as a key-value store and ignore its document-oriented capabilities.

For values that do conform to JSON, however, Couchbase provides document-oriented capabilities. Indeed, it shares many features with MongoDB. Couchbase organizes data into buckets, a concept very much akin to a MongoDB collection (and comparable to a relational table minus the schema). Both databases support sharding, replication, automatic failover, and the ability to add capacity without downtime. Both provide rich monitoring capabilities. And both emphasize consistency over high-availability, requiring that all writes go to a single master responsible for a segment of a key space so that every read consistently returns the latest value.

And both Couchbase and MongoDB enable developers to retrieve collections of documents using fields other than the primary key, but the manner in which they do so varies significantly. Each has taken a different data retrieval feature from the relational database world as its starting point. MongoDB supports secondary indexes and ad-hoc queries, much like SQL but without joins. Couchbase supports materialized views. A view is very much like a pre-written query. A materialized view stores the sorted results of the query in memory or on disk ready for fast retrieval. In Couchbase, a developer creates these materialized views using JavaScript map-reduce functions. Couchbase supports updating views either on writes or reads, the latter approach making sense following bulk updates. While not as easy to define as a MongoDB query, a Couchbase view, just like a view in a relational database, can look dramatically different than the underlying data and may include aggregated values.

Because views on large data sets can take a long time to materialize, Couchbase provides for developer views, which run against a random subset of data. When ready, a developer can publish a view to production and have it materialized against an entire data set. Last week, I had the chance to watch Matt Ingenthron, Director of Developer Solutions at Couchbase, and Dustin Sallings, Couchbase’s Chief Architect, demo this feature at a Couchbase user group in San Francisco. It is clearly a critically important feature for large data sets.

Whether the Couchbase or MongoDB model makes the most sense for any application depends on the use case. If an application requires a key-value store in some places and document-oriented features in others, Couchbase might make sense. And if an application built on Memcached now needs a more reliable cache, Couchbase may well make a good fit. But for developers looking for ad-hoc queries and indexing reminiscent of relational databases, MongoDB would feel more comfortable. No matter your preference, the evolution of these two open-source document-oriented NoSQL databases demonstrates that choice is alive and well in the open source community.

Related Posts:
Exploring NoSQL: MongoDB
Exploring NoSQL: Memcached
Exploring NoSQL: Redis
Exploring NoSQL: Riak

Posted in Couchbase, CouchDB, MongoDB, NoSQL, Uncategorized

Be Careful with Sloppy Quorums

In my last posts on eventual consistency, I mentioned that R+W>N guarantees consistency. Thanks to commentator senderista for pointing that this statement does not hold in the case of sloppy quorums.

The Amazon Dynamo article describes sloppy quorum as such:

If Dynamo used a traditional quorum approach it would be unavailable during server failures and network partitions, and would have reduced durability even under the simplest of failure conditions. To remedy this it does not enforce strict quorum membership and instead it uses a “sloppy quorum”; all read and write operations are performed on the first N healthy nodes from the preference list, which may not always be the first N nodes encountered while walking the consistent hashing ring.

In other words, the cluster has more than N nodes. And in the case of network partitions, writes could be made to nodes outside of the set of top N preferred nodes. In this case, there would be no guarantee that writes and reads over the N nodes would overlap since the nodes that constitute N are in flux. Therefore, the formula R+W>N has no meaning.

For example, if N equals three in a cluster of five nodes (A, B, C, D, and E) and nodes A, B, and C are the top three preferred nodes, then minus any error conditions writes will be made to nodes A, B, and C. But if B and C were not available for a write, then a system using a sloppy quorum would write to D and E instead. If this were to happen, then even if the write quorum (W) was 3 and the read quorum (R) was 2, making R+W>N, a read immediately following this write could return data from B and C, which would be inconsistent because only A, D, and E would have the latest value.

According to the Amazon Dynamo article, Dynamo mitigates this inconsistency through hinted handoffs. If the system needs to write to nodes D and E instead of B and C, it informs D that its write was meant for B and informs E that its write was meant for C. Nodes D and E keep this information in a temporary store and periodically poll B and C for availability. Once B and C become available, D and E send over the writes.

When balancing the tradeoffs between consistency and availability, it is vital to understand how any particular system handles quorums, whether strict or sloppy.

In some cases, I’ve found the literature unclear on this point. In a post on eventual consistency, Werner Vogels writes “If W+R > N, then the write set and the read set always overlap and one can guarantee strong consistency.” While true in a strict quorum system, this statement is not true in the case of Dynamo and not true of systems based on Dynamo that utilize sloppy quorums.

A page in the Riak wiki states that “R + W > N ensures strong consistency in a cluster” and includes a reference to the post by Vogels on eventual consistency. However, a recent Basho posting states that Riak uses sloppy quorums by default, though it uses strict quorums whenever the values of PR and PW are used rather than R and W. Overall, I didn’t find the Riak documentation clear on this important distinction.

Thanks again to the commentator who pointed out my mistake. As I continue my series exploring NoSQL databases, I’ll be more careful to point out where sloppy quorums could affect consistency.

Posted in NoSQL, Riak
%d bloggers like this: