by Zachariah Sharek, PhD, Director of Strategy and Innovation, CivicScience
A recent article published in the Financial Times (FT) titled, “Big data: are we making a big mistake?” (March 28, 2014) paints a pretty scathing picture of the real world applications of Big Data and the results it can (and cannot) derive. The article has caused quite a buzz within in the market research and technology industries over the past few weeks, and to avoid being tarred by the same brush, since I do work for a company that mines and analyzes real-time and historic online consumer research data using data science, I wanted to address some of the points the article raises and offer my own take on the real value of Big Data.
The FT article presents four claims made by proponents of Big Data; briefly summarized, these claims are:
- Big data analysis produces uncannily accurate results
- Big data can capture an entire population, rendering statistical sampling unnecessary
- Big data correlations render concerns about causality obsolete
- Big data results can be interpreted without the use of statistical models
Data analysis gives us many wonderful benefits, especially when properly applied to huge datasets, but I don’t believe that any of the above-cited claims would be endorsed by any serious statistician. In fact, in the business where I work, we do our best to dissuade clients from believing such bold and unrealistic claims. However, the devil is in the details, especially when dealing with statistics, and there are nuances in these particular claims that are useful to explore.
Analysis can produce accurate results and Big Data analysis can be extremely accurate. It can also be incredibly inaccurate. I think a lot of the confusion about the accuracy of Big Data comes from confusion about deploying analytical models. Consider Google Flu Trends, which the FT uses as an example of a Big Data model that failed miserably. How much of a failure was it truly? When Flu Trends failed, it over-predicted the effects of the flu; in statistical parlance, it was biased in favor of the flu (vs. not flu). From a policy standpoint, this bias is not necessarily undesirable. If we had to assign costs to errors, we might place more weight on the cost of under-predicting the flu, perhaps because of the number of deaths and lost days of work caused by the flu are much greater than the costs of over-predicting the flu. Thus, the most effective model (measured financially) might not be the most accurate model.
Further, the article contrasts Flu Trends with the CDC’s own, much slower data. Why should we consider these as two distinct separate models? The fields of statistics and machine learning have long shown that the best models are often ensembles of various models, aggregated in a way to minimize error. If we consider these two models as a single model, we could develop a system that can alert us to dramatic outbreaks weeks ahead of time, at the cost of increased false alarms.
Other Google products have inured us towards using things that are still in beta development and we should extend this allowance to data models. Flu Trends can be adjusted and redeployed to better discriminate between searches for articles about the flu in general and people seeking relief for symptoms of the flu.
The second claim addresses the need for statistical sampling. While some Big Data areas can gather information about an entire population, most marketing-focused Big Data applications are forced to collect samples, albeit some extremely large samples. The problem here is not, as the article calls it, “making old statistical sampling techniques obsolete,” but of addressing the issue of statistical significance. Almost everyone who knows just a little bit about statistics, when confronted with the results of a sample or survey is sure to ask, “But is it statistically significant?” The problem with this question in a Big Data world is that the results are almost always statistically significant. This is because classical statistics calculates significance partly on a basis of sample size. As sample sizes increase, even quite small differences between groups can become statistically significant. The more apropos question is to ask about effect size. How large is the actual difference between groups? For example, men are a few inches taller than women on average, i.e. the effect size of being male is a few inches. If everyone was 100 feet taller, those few inches of difference, while still statistically significant, would be effectually insignificant.
The third claim is that since causality is hard and correlations give us everything we need, we can ignore issues of causality. Causality is hard; it can be extremely difficult figuring out what caused something to happen, especially when dealing with marketing questions.
A simple way to differentiate between correlation and causation is: correlations tell us the “who” and “what,” whereas causality tells us the “why” and “how.” Both correlation and causation are extremely useful to marketing, but correlations are much less costly to procure. For example, it is much easier to determine that consumers who like a particular product over another are also more (or less) likely to watch ten or more hours of television a week than it is determine why they prefer the one product over another. Furthermore, the why and how are often harder to describe precisely, whereas “whos” and “whats” are much easier to articulate. I suspect that the ease of access and articulation are why correlations are so overused and misused.
Finally, the article asks how should we interpret the results of Big Data analyses? Are statistical models necessary? One particular aspect of the need for models is the problem of multiple comparisons. Usually, a result is considered significant if the chance of it occurring (according to the null hypothesis) is less than 5%. This means that, on average, one out of 20 statistically significant tests will be by chance. This is a problem in traditional research where a couple of dozen comparisons might be made; this is a huge problem with a Big Data dataset where thousands of comparisons can be made. Compounding this problem is that almost all of these comparisons can be statistically significant.
There are techniques for addressing the multiple comparisons problem. For example, at CivicScience where I work, we use what are called False Discovery Rate techniques to detect and reject spurious statistically significant results (specifically, we use the Benjamini-Hochberg procedure) and then rank the results by effect size measures (specifically, we use Tschuprow’s T).
Big Data has lot of potential in transforming business but it is important to understand what it can and cannot do. Applying it takes skill, finesse and patience to avoid the errors and inaccurate expectations described above.
Zachariah Sharek, is a behavioral decision scientist and the Director of Strategy and Innovation for CivicScience. He holds a PhD in Organizational Behavior from the Tepper School of Business at Carnegie Mellon University.