The Lyons Den: How Do You Measure Data Accuracy?
May 30, 2013 Leave a comment
This week, eXelate unveiled the second white paper in our Smart Data series, accurate data is smart data. We believe that Big Data has yet to fulfill its promise of providing a clear and consistent business advantage. To address this, we argue that the industry needs to evolve from Big Data to Smart Data – data that drives business value because it is accurate, actionable, and agile. The accuracy white paper answers why accuracy is important, how to gauge it, and finally, what it takes to implement data accuracy.
Which brings us to today’s topic, which builds on the white paper: How do you measure accuracy?
In measuring accuracy, it seems to me that many confuse accuracy and precision. Some, for example, might argue that a consensus approach which polls data providers is the best way to judge data accuracy. This approach essentially claims that attributes with the highest consensus across data providers is the most accurate. We categorically reject this simplistic view as it equates agreement – or precision – with accuracy. In fact, checking multiple data sources against one another may create a kind of confirmation bias (the tendency for people to believe data that supports what they already believe to be true).
This can happen when multiple vendors have the same technique to collect data. Take income data, for example. If several data vendors are supplying the mean income for a zip code to individual users, each will agree for a given household, but each will be wrong for most of the households. And, interestingly, if five data vendors applied the zip code mean to a specific user and one supplied the actual income to the same user, the actual income would be discounted as inaccurate. Serial correlation in data streams is often misinterpreted as accuracy. That is, everyone agreeing doesn’t necessarily make it right; it may just be that many data sources are wrong for the same reason – and that is why the consensus approach breaks down. Precision is not accuracy.
We therefore believe that online data must be validated with gold-standard, independent third party sources, and that these sources must contain registered (user-level) audience demography. Validating eXelate data against third party sources such as comScore and Nielsen allows us to verify that our data achieves the greatest possible balance of scale and accuracy, without degrading our data by including overly loose criteria.
There are a couple of very straightforward ways to validate the accuracy of data against a gold standard. For binary (either/or) outcomes, a confusion matrix is probably the most common approach. Let’s look at the simple case of gender. The below chart represents a simplified confusion matrix which allows us to judge the accuracy of our data:
The above chart reads as follows. The rows represent what eXelate believes about a user. The columns are the “gold standard” against which we are being evaluated. So, from the perspective of correctly identifying males, there are four potential outcomes:
- Accurate (male) – technically known as “true positive” – means that both eXelate and the “gold standard” knows the user to be male; or, we got it right.
- Accurate (female) – or “true negative” – means that both eXelate and the “gold standard” knows the user to be female, so again we got it right (it’s a “true negative” because from the perspective of identifying males, we’re technically agreeing that this user is not male).
- Inaccurate (actually male) – or “false negative” – means that eXelate believes the user to be female, but they are actually male; or, we identified this user incorrectly.
- Inaccurate (actually female) – or “false positive” – means that eXelate believes the user to be male, but they are actually female; wrong again.
In the above, accuracy is defined as,
So, if you saw the following results:
your accuracy would be:
meaning that, overall, you were right 81% of the time.
To underscore the fact that precision is not equal to accuracy, even in technical terms, we can note that precision for males is defined as:
meaning that if you showed an ad to users that you thought were males, you would be on target 83% of the time.
These equations, as well as something called recall (of males, how many did we reach?), are KPIs eXelate employs to continuously evaluate and improve our data.
So, to summarize, accuracy matters! Everyone in our business needs to understand what it is and what it is not. Accuracy needs to be calculated using accepted methods against gold standards. And those that fail to give it the attention it deserves do so at their own risk!
We’ll have much more to say on Smart Data and I encourage you to read our Smart Data series of white papers and continue to follow us @eXelate.