Michael E. Driscoll, founder of Dataspora, a data consulting firm in San Francisco, gave a presentation on big data to the SDForum Business Intelligence SIG on June 15, 2010 at the SAP campus in Palo Alto. Big data matters now, Driscoll told the audience, as we live in an “age of data”: not only do we humans generate massive quantities of data, but we have surrounded ourselves with sensors that continuously emit data to the cloud.
Driscoll divides data sets into three categories: small data sets of under 10G that fit in the memory of a single machine and can be manipulated with Excel, medium data sets ranging from 10GB to 1TB that fit on the hard disk of a single server and can be manipulated with traditional relational databases, and big data sets exceeding 1TB that are most efficiently manipulated with distributed databases.
A data scientist, according to Driscoll, must understand the nature of data, the tools with which to manipulate it, and the questions to ask of it. The key skills of the data scientist are munging—because “the world of data is messy,” visualizing—because data must tell a story to be useful, and modeling—because what we care about are patterns.
Driscoll revealed to the BI SIG the nine secrets of the successful data scientist:
1. Choose the right tool. Do not use a chainsaw to cut butter.
2. Compress everything. Reduce your big data to medium size whenever possible.
3. Split data into smaller data partitions, apply analysis on the partitions, and combine the results.
4. Start with samples.
5. Use statistics.
6. Copy from others. Take advantage of open source.
7. Use charts expressively. Do not limit yourself to the standard pie and bar charts.
8. Color with care. “Color can enhance or insult.” Used correctly, hues and shades add dimensions. Used incorrectly, colors confuse the eyes.
9. Tell a story. To be meaningful, data must communicate.
After revealing his secrets, Driscoll recounted a recent success story. A large telecom had an important question to ask of its big data: why do its cell phone customers switch carriers. The data covered millions of customers and billions of calls. The telecom’s initial hypothesis was that customers left because of poor call quality, but it turns out that the number of dropped calls does not correlate significantly with customer churn. Driscoll discovered a far more predictive factor. We and the people we call frequently form social networks, which Driscoll modeled as a call graph. When one member of a social network switches carriers, the churn among connected nodes increases 700%. The telecom now has a predictive model with which to target marketing.
The data scientist, Driscoll concluded, should take data and create products, transforming data into business value.