Tuesday night I drove to Santa Clara for the Big Data Camp, learned more about Hadoop, and even ran into a few Dell colleagues. Thanks to Dave Nielson for organizing the camp and to Ken Krugler for his great overview of Hadoop.
While the phrase big data lacks precision, it is of growing importance to ever more enterprises. With data flooding in from web sites, mobile devices, and those ever present sensors, processing and deriving business value out of it all becomes ever more difficult. Data becomes big when it exceeds the storage and processing capabilities of a single computer. While an enterprise could spend lots of money on high-end, specialized hardware finely tuned for a specific database, at some point that hardware will become either inadequate or too expensive.
Once data exceeds the capabilities of a single computer, we face the problem of distributing its storage and processing. To solve this problem, big data innovators are using Hadoop, an open source platform capable of scaling across thousands of nodes of commodity hardware. Hadoop includes both a distributed file system (HDFS) and a mechanism for distributing batch processes. It is scalable, reliable, fault tolerant, and simple in its design.
Many enterprises, however, would only need to process big data periodically, perhaps daily or weekly. When running a big data batch job, an organization might want to distribute the load over hundreds or thousands of nodes to assure completion within a time window. But when the batch job completes, those nodes may not be needed for Hadoop until the time comes for the next batch job.
This use case—the dynamic expansion and contraction in the need for computing resources—fits well with the capabilities of cloud computing, whether private, public, or hybrid. (In this case, I mean cloud as in Infrastructure-as-a-Service.) A cloud infrastructure could spin up the nodes when needed and free the resources when that need goes away. In public cloud computing, that freeing of resources would directly impact cost.
During his presentation, Ken Krugler suggested that we think of Hadoop as a distributed operating system in that Hadoop enables many computers to store and process data as if they formed a single machine. I’d add that cloud computing—virtualization, automation, and processes that enable an agile infrastructure—may be needed to complete this operating system analogy so that this distributed operating system not only manages distributed resources but does so for maximum efficiency across a multitude of use cases.
Extra Credit Reading (Dell Resources on Hadoop):
Philippe Julio, Hadoop Architecture, http://www.slideshare.net/PhilippeJulio/hadoop-architecture
Joey Jablonski, Hadoop in the Enterprise, http://www.slideshare.net/jrjablo/hadoop-in-the-enterprise
Aurelian Dumitru, Hadoop Chief Architect, Dell’s Big Data Solutions including Hadoop Overview, http://www.youtube.com/watch?v=OTjX4FZ8u2s
That’s me on the left, Aurelian Dumitru, Dell’s Hadoop Chief Architect, in the center, Barton George, Director of Marketing for Dell’s Web & Tech vertical, on the right. Thanks to DJ Cline for the photo. See http://bit.ly/iHJJIl for more photos of the Big Data Camp.