Tuesday night I drove to Santa Clara for the Big Data Camp, learned more about Hadoop, and even ran into a few Dell colleagues. Thanks to Dave Nielson for organizing the camp and to Ken Krugler for his great overview of Hadoop.
While the phrase big data lacks precision, it is of growing importance to ever more enterprises. With data flooding in from web sites, mobile devices, and those ever present sensors, processing and deriving business value out of it all becomes ever more difficult. Data becomes big when it exceeds the storage and processing capabilities of a single computer. While an enterprise could spend lots of money on high-end, specialized hardware finely tuned for a specific database, at some point that hardware will become either inadequate or too expensive.
Once data exceeds the capabilities of a single computer, we face the problem of distributing its storage and processing. To solve this problem, big data innovators are using Hadoop, an open source platform capable of scaling across thousands of nodes of commodity hardware. Hadoop includes both a distributed file system (HDFS) and a mechanism for distributing batch processes. It is scalable, reliable, fault tolerant, and simple in its design.
Many enterprises, however, would only need to process big data periodically, perhaps daily or weekly. When running a big data batch job, an organization might want to distribute the load over hundreds or thousands of nodes to assure completion within a time window. But when the batch job completes, those nodes may not be needed for Hadoop until the time comes for the next batch job.
This use case—the dynamic expansion and contraction in the need for computing resources—fits well with the capabilities of cloud computing, whether private, public, or hybrid. (In this case, I mean cloud as in Infrastructure-as-a-Service.) A cloud infrastructure could spin up the nodes when needed and free the resources when that need goes away. In public cloud computing, that freeing of resources would directly impact cost.
During his presentation, Ken Krugler suggested that we think of Hadoop as a distributed operating system in that Hadoop enables many computers to store and process data as if they formed a single machine. I’d add that cloud computing—virtualization, automation, and processes that enable an agile infrastructure—may be needed to complete this operating system analogy so that this distributed operating system not only manages distributed resources but does so for maximum efficiency across a multitude of use cases.
Extra Credit Reading (Dell Resources on Hadoop):
Philippe Julio, Hadoop Architecture, http://www.slideshare.net/PhilippeJulio/hadoop-architecture
Joey Jablonski, Hadoop in the Enterprise, http://www.slideshare.net/jrjablo/hadoop-in-the-enterprise
Aurelian Dumitru, Hadoop Chief Architect, Dell’s Big Data Solutions including Hadoop Overview, http://www.youtube.com/watch?v=OTjX4FZ8u2s
That’s me on the left, Aurelian Dumitru, Dell’s Hadoop Chief Architect, in the center, Barton George, Director of Marketing for Dell’s Web & Tech vertical, on the right. Thanks to DJ Cline for the photo. See http://bit.ly/iHJJIl for more photos of the Big Data Camp.
[…] James Downey has written about this event! […]
Mr. Downey,
I found your blog and wanted to see if you might be interested in writing about a Certificate program at the University of Washington on the topic of Cloud Computing. It’s a program which is available both in the classroom and online and we’re trying to get the word out among key bloggers within this community.
Would you be interested in participating in a conference call with other bloggers to learn more about the program? As program director I would participate along with some faculty and/or board members from the program so you could learn more about this program offering during the call. We can also provide you with some useful content that you might find helpful in writing a piece about the program. Lastly, if you have colleagues or fellow bloggers that you know about and whom you recommend we should include, please feel free to let us know.
Thanks very much for considering this request.
Erik
Erik Bansleben, Ph.D.
Program Development Director, Academic Programs
UW Professional & Continuing Education
ebansleben@pce.uw.edu
206-221-6243
[…] Downey (@james_downey) posted Big Data and Hadoop: A Cloud Use Case on […]
[…] Downey, J. (2011, July 1). Big Data and Hadoop: A Coud Use Case. Retrieved from Jim Downey: https://jimdowney.net/2011/07/01/big-data-and-hadoop-a-cloud-use-case/ […]
[…] Downey, J. (2011, July 1). Big Data and Hadoop: A Coud Use Case. Retrieved from Jim Downey: https://jimdowney.net/2011/07/01/big-data-and-hadoop-a-cloud-use-case/ […]