The Lyons Den: How Do You Measure Data Accuracy?

kevin lyonsThis week, eXelate unveiled the second white paper in our Smart Data series, accurate data is smart data. We believe that Big Data has yet to fulfill its promise of providing a clear and consistent business advantage. To address this, we argue that the industry needs to evolve from Big Data to Smart Data – data that drives business value because it is accurate, actionable, and agile. The accuracy white paper answers why accuracy is important, how to gauge it, and finally, what it takes to implement data accuracy.

exelate smart data

Which brings us to today’s topic, which builds on the white paper: How do you measure accuracy?

In measuring accuracy, it seems to me that many confuse accuracy and precision. Some, for example, might argue that a consensus approach which polls data providers is the best way to judge data accuracy. This approach essentially claims that attributes with the highest consensus across data providers is the most accurate. We categorically reject this simplistic view as it equates agreement – or precision – with accuracy. In fact, checking multiple data sources against one another may create a kind of confirmation bias (the tendency for people to believe data that supports what they already believe to be true).

This can happen when multiple vendors have the same technique to collect data. Take income data, for example. If several data vendors are supplying the mean income for a zip code to individual users, each will agree for a given household, but each will be wrong for most of the households. And, interestingly, if five data vendors applied the zip code mean to a specific user and one supplied the actual income to the same user, the actual income would be discounted as inaccurate. Serial correlation in data streams is often misinterpreted as accuracy. That is, everyone agreeing doesn’t necessarily make it right; it may just be that many data sources are wrong for the same reason – and that is why the consensus approach breaks down. Precision is not accuracy.

We therefore believe that online data must be validated with gold-standard, independent third party sources, and that these sources must contain registered (user-level) audience demography. Validating eXelate data against third party sources such as comScore and Nielsen allows us to verify that our data achieves the greatest possible balance of scale and accuracy, without degrading our data by including overly loose criteria.

There are a couple of very straightforward ways to validate the accuracy of data against a gold standard. For binary (either/or) outcomes, a confusion matrix is probably the most common approach. Let’s look at the simple case of gender. The below chart represents a simplified confusion matrix which allows us to judge the accuracy of our data:

equation1

The above chart reads as follows.  The rows represent what eXelate believes about a user. The columns are the “gold standard” against which we are being evaluated. So, from the perspective of correctly identifying males, there are four potential outcomes:

  • Accurate (male) – technically known as “true positive” – means that both eXelate and the “gold standard” knows the user to be male; or, we got it right.
  • Accurate (female) –  or “true negative” – means that both eXelate and the “gold standard” knows the user to be female,  so again we got it right (it’s a “true negative” because from the perspective of identifying males, we’re technically agreeing that this user is not male).
  • Inaccurate (actually male) – or “false negative” – means that eXelate believes the user to be female, but they are actually male; or, we identified this user incorrectly.
  • Inaccurate (actually female) – or “false positive” – means that eXelate believes the user to be male, but they are actually female; wrong again.

In the above, accuracy is defined as,

equation2

So, if you saw the following results:

equation3

your accuracy would be:

equation4

meaning that, overall, you were right 81% of the time.

To underscore the fact that precision is not equal to accuracy, even in technical terms, we can note that precision for males is defined as:

equation5

meaning that if you showed an ad to users that you thought were males, you would be on target  83% of the time.

These equations, as well as something called recall (of males, how many did we reach?), are KPIs eXelate employs to continuously evaluate and improve our data.

So, to summarize, accuracy matters!  Everyone in our business needs to understand what it is and what it is not. Accuracy needs to be calculated using accepted methods against gold standards. And those that fail to give it the attention it deserves do so at their own risk!

We’ll have much more to say on Smart Data and I encourage you to read our Smart Data series of white papers and continue to follow us @eXelate.

The Lyons Den: Removing Sample Bias via Association Rules

kevin lyonsWelcome to the inaugural issue of The Lyons Den, a monthly blog dedicated to Big (or rather, Smart) Data and the analytics challenges facing our industry.  Within each of these blogs, I will take up the ‘why,’ ‘how,’ and ‘what’ of the many challenges that face us, all the while letting the reader see a little bit under the analytical hood here at eXelate.

Just a little about myself – I am an analytic solutions and marketing technology specialist.  As SVP of Analytics here at eXelate, I am responsible for leading the vision and execution of the company’s data science and marketing analytics activities.  Prior to eXelate, I was the VP of Analytics and Business Intelligence at x+1, where my team strove to maximize website profitability via analytics and real-time decisioning.  And before x+1, I spent over a decade in web and marketing analytics at Harte-Hanks, a large marketing service provider.  The rest of my background is here.

And so, without further ado, today’s topic – Removing Sample Bias via Association Rules.

If you’re like us, 80% to 90% of your analytical challenges ultimately derive from the rows (the sample) and only about 10% to 20% from the columns (the attributes).  Within ad tech, there can be many nefarious causes for a suboptimal sample, such as the presence of bots or some sort of impression or seed bias.  Here, we’re going to look at one certain type of bias, geographical sampling.

Traditional approaches have always existed to uncover such biases, many of them being graphical in nature, but we have found that association rules are particularly useful when attempting to find relationships between thousands of variables.  Association rules have proven so useful that we often find such models to be particularly robust; even when we are releasing models built through other methods such as a regularized regression, we’ll still build association rule sets just to help identify sample bias.

Before going into an example, let me define association rules:

Association rules have been used for quite some time, and although they are starting to make a strong comeback, they are considered somewhat old-school. Like many, I first encountered association rules years back in the retail business in the attempt to find commonalities between products (or sets of items that appear together). As an example, let’s presume that we find that 17% (known as the confidence) of internet users that have purchased men’s shoes and men’s ties online are also men’s belts purchasers. If, say, 1% of all transactions contain this particular item set, then we can also say that it has a 1% support.  Note that association rule learning generally does not take into account the sequence of events within item sets, unlike some other methods. And, of course, ‘item sets’ are not necessarily limited to only retail POS items. They can contain user agent information, geo-demographics, user interests or just about any attribute or user action under investigation.

Now, let me give a rather specific example  how association rules might be used to find sample bias:

eXelate was tasked with building a nation-wide predictive model for whom might be more likely to purchase a certain branded dairy-based consumer package good .  As an initial check, we developed Association Rules that immediately made it clear to us that our sample, or specifically our dependent variable (or seed), was geographically biased toward the extended Chicagoland area, which we were then able to confirm independently.  The team then removed geography from the independent variable set (or predictors) and, interestingly, a number of surrogate attributes for living in Chicagoland ‘popped’ – the most obvious of which were travel to and from Chicago and being a Chicago sports fan.  But, a number of less obvious variables or combination of variables also proved significant that essentially identified a user as a Chicagoan such as working within the property and casualty industry while accessing the internet via certain ISPs.  As this issue came up again and again across multiple clients and projects, we have since created an automated system for detecting geographical bias in our data as soon as we encounter them through association rules.

So, why is this important? Well, somewhat paradoxically, had we not removed these ‘highly-predictive’ geographic-based variables, we would have directed our client to target the exact wrong people.  Roughly speaking, as the seed data was so overrepresented with the Chicagoland area, this would have been the (incorrect) rank order for targeting:

  • People living in Chicago with attributes similar to those that already preferred the brand (a good target audience, for sure)
  • People living in Chicago with attributes similar to those that in fact didn’t prefer the brand (the wrong audience; the very fact that they lived in Chicagoland was the most predictive attribute for being with in the seed list)
  • And then, well down the list and easily ignored, people not living in Chicago with attributes similar to those that already preferred the brand (that is, the users we actually wanted)

As an aside, geographical underrepresentation can present problems as well.  We had, for example, a very small sample of converters within Wisconsin and so the fact that a few of them had perhaps randomly liked this product made Wisconsin look like an excellent target audience.  But, the sample was far too small to be reliable, so we needed to account for this as well.

Once we removed the geographic-based variables completely, a very logical and practical model emerged, including highly predictive prior shopping behaviors as well as a select set of demographic and interest data.  And now, before any of our models are released, they are automatically and thoroughly checked for over- and under-geographic representation which we believe makes them far more robust.

So to wrap up a bit:

  • Association rules are a method for finding commonalities between attributes,  contained within item sets, and are measured by their support and confidence.
  • Why – They are a critical tool to an analytics professional in order to sanitize a data set prior to driving conclusions.
  • How – There are a number of different algorithms for developing association rules. We’ve tested a number of these and have settled on ariori, though we have automated and modified how this algorithm is applied in significant ways (perhaps, a topic for a later post).
  • What – An example of utilizing an Association Rule is removing the geographic bias to help ensure that highly predictive and robust models can be created.

What types of association rules do you utilize? Let me know and let’s continue the conversation.
Thanks,

Kevin