Microsoft Research's Kate Crawford has a terrific blog for HBR about big data. In the post, she discusses the hype regarding big data, and she talks about the hidden biases that we must be aware of when analyzing large data sets:
Data and data sets are not objective; they are creations of human
design. We give numbers their voice, draw inferences from them, and
define their meaning through our interpretations. Hidden biases in both
the collection and analysis stages present considerable risks, and are
as important to the big-data equation as the numbers themselves.
Crawford has some terrific examples of biases in data sets. For instance, she talks about how the Twitter data after Hurricane Sandy offers a distorted view of the storm. Why? As the storm progressed, people in the hardest hit areas ran out of battery power on their cellphones. Thus, they stopped tweeting. Folks in Manhattan, where the storm was significant, but not as devastating, engaged in much more Twitter activity. Moreover, people in the lowest income groups are not as well represented on Twitter, because many do not own smartphones. As she writes, "We can think of this as a "signal problem": Data are assumed to
accurately reflect the social world, but there are significant gaps,
with little or no signal coming from particular communities."
The lesson is clear. Begin your big data project by asking: How was the data collected? What populations are overrepresented? What populations are underrepresented? Beyond that, you should ask: Who collected and assembled the data set? Do they have an agenda? Are they biased in any way? Often, the biggest bias in big data has nothing to do with access to technology or underrepresented populations. Instead, the most significant bias lies inside the mind of the person assembling the data. Their agenda clouds the process of data collection.