June 27, 2017

Benefits of Skepticism: Big Data

Big data is the future of design. Big data is the future of marketing. Big data is empirical. Big data is going to make up for the fallibility of the human mind.

Though there are a lot of potential applications for utilizing this data, it is important to look at it for what it is: a big pile of information we’re still trying to figure out how to sort.

Big data is the term for large data sets, “typically consisting of billions or trillions of records, that are so vast and complex that they require new and powerful computational resources to process.”[1] Often, this data is accumulated through computational processes: algorithms, machine learning, etc.

It feels significant because it is an enormous amount of information that outskirts the need for research methods and the design of a research study. If you can just access and process the data, you have your research right there.

There are, however, many problems with this usage of big data, and our utopian view of it. Big data, much like other kinds of data sets, can be incredibly biased. It can be misinterpreted. Its quality varies. It is not as sound or completely reliable as we would like it.

There are 4 things we need to consider when talking about big data:

1. The cum hoc ergo propter hoc fallacy

Latin for “with this, therefore because of this.” The phrase presents a logical fallacy about correlation. If two variables are correlated, we are often tempted to assume that one caused the other. The vast majority of assumptions made using big data are based on correlation. That one thing causes the other, or is in someway related to the other.

For example:

“A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two. Likewise, from 1998 to 2007 the number of new cases of autism diagnosed was extremely well correlated with sales of organic food (both went up sharply), but identifying the correlation won’t by itself tell us whether diet has anything to do with autism.”[1]

Just because two variables are correlated does not necessarily mean one caused the other. Though that is the case some of the time, it is important to understand that it is not the case all of the time.

We used Google Correlate, which is the algorithm responsible for Google Flu Trends, and looked up a few random words. Google Correlate “finds search patterns which correspond with real-world trends” according to their site. Out of curiosity, we looked up “robots,” which correlates with a variety of things, but our favorite is the phrase “being a girl.”

So, at some point in 2005, a bunch of people were Googling the phrase “robots” and the phrase “being a girl.” If we were to make a cum hoc ergo propter hoc assumption about these variables, we would say that there is something about robots that is like being a girl. Or that being a girl caused us to think about robots.

It’s important to look beyond your data to add context. In early March 2005, the wonderfully whimsical children’s movie (starring the voices of Ewan McGregor, Mel Brooks, and Robin Williams) Robots hit theaters. Also, around the same time, GAP aired a commercial featuring Sarah Jessica Parker, singing a song called “I Enjoy Being a Girl.”

It’s a silly example, but it is something to consider. We can’t assume that correlation is causality, especially not without research and context outside of the dataset.

2. Recency Bias

If 90% of the world’s data was created in the last few years, we have an inherent recency bias in our data.

Recency bias is “the tendency to assume that future events will closely resemble recent experience. It’s a version of what is also known as the availability heuristic: the tendency to base your thinking disproportionately on whatever comes most easily to mind. It’s also a universal psychological attribute.”

The present moment is always the largest dataset, having a greater influence on our research than anything in the past. Thus, if we’re looking at big data for something predictive, something to tell us how things will be in the future, we need to know what is significant in our present data. We need to wash away what isn’t significant. We also need to include the past. We can not determine our future based on what has happened in the last couple years alone.

3. Confirmation Bias

Another very human psychological attribute that affects our data is confirmation bias. Confirmation bias is the “seeking or interpreting of evidence in ways that are partial to existing beliefs, expectations, or a hypothesis in hand.” This, much like recency bias, is a universal psychological characteristic. This is something everyone does, whether they are aware of it or not.

“Once we have formed a view, we embrace information that confirms that view while ignoring, or rejecting, information that casts doubt on it. Confirmation bias suggests that we don’t perceive circumstances objectively. We pick out those bits of data that make us feel good because they confirm our prejudices. Thus, we may become prisoners of our assumptions.”

The issue here is that we are coming to the data with questions. Because big data is far too large for it to yield one result, like a designed research study might, we approach the data with a question. That question, presumably, has an answer. We, as people, have an assumption of what that answer is going to be, and tend to look for data that confirms our assumptions.

This is just a truth of human psychology. We all naturally create linkages between the things we want to believe and what evidence exists that would confirm those beliefs. However, our general inability to critically think about data, especially when it’s giving us the answer we want, becomes problematic.

4. Data Quality

Data, in the past, was a result of research. Now, the majority of our data comes from private companies who are collecting it without a designed study or a specific goal. It is simply being dug up and piled somewhere. Because of this, it’s hard to tell what data we’re missing. We don’t have a good sense of what we have, let alone what the gaps in the information are.

Research is designed for a reason: to work toward an empirical and well rounded set of data, to know where it comes from, to be aware of its accuracies and faults. When we use random data, we don’t attribute for what we’re missing, for what its faults are.

For example:

“…consider the Twitter data generated by Hurricane Sandy, more than 20 million tweets between October 27 and November 1. A fascinating study combining Sandy-related Twitter and Foursquare data produced some expected findings (grocery shopping peaks the night before the storm) and some surprising ones (nightlife picked up the day after — presumably when cabin fever strikes). But these data don’t represent the whole picture. The greatest number of tweets about Sandy came from Manhattan. This makes sense given the city’s high level of smartphone ownership and Twitter use, but it creates the illusion that Manhattan was the hub of the disaster. Very few messages originated from more severely affected locations, such as Breezy Point, Coney Island and Rockaway. As extended power blackouts drained batteries and limited cellular access, even fewer tweets came from the worst hit areas. In fact, there was much more going on outside the privileged, urban experience of Sandy that Twitter data failed to convey, especially in aggregate. We can think of this as a “signal problem”: Data are assumed to accurately reflect the social world, but there are significant gaps, with little or no signal coming from particular communities.” See more here.

Big data is not sorted through or thought about critically. Because it’s simply gathered and stockpiled, it’s full of holes, inaccuracies, and misleading correlations. We must learn to read and scrutinize our data thoroughly. It is a form of literacy we have not developed because our society has an overarching believe that computation is somehow beyond human fallibility. We forget that the data is curated by algorithms we wrote, and is made up of our own information.

Overall, this is not to say big data isn’t a valuable resource. The potential for its application in a variety of fields is significant. We do, however, need to develop these literacies. We need to be skeptical about our data, where it comes from, and what it’s telling us.

– – – – –

[1] Gary Marcus & Ernest Davis, The New York Times.