The Lyons Den: Removing Sample Bias via Association Rules
April 29, 2013 Leave a comment
Welcome to the inaugural issue of The Lyons Den, a monthly blog dedicated to Big (or rather, Smart) Data and the analytics challenges facing our industry. Within each of these blogs, I will take up the ‘why,’ ‘how,’ and ‘what’ of the many challenges that face us, all the while letting the reader see a little bit under the analytical hood here at eXelate.
Just a little about myself – I am an analytic solutions and marketing technology specialist. As SVP of Analytics here at eXelate, I am responsible for leading the vision and execution of the company’s data science and marketing analytics activities. Prior to eXelate, I was the VP of Analytics and Business Intelligence at x+1, where my team strove to maximize website profitability via analytics and real-time decisioning. And before x+1, I spent over a decade in web and marketing analytics at Harte-Hanks, a large marketing service provider. The rest of my background is here.
And so, without further ado, today’s topic – Removing Sample Bias via Association Rules.
If you’re like us, 80% to 90% of your analytical challenges ultimately derive from the rows (the sample) and only about 10% to 20% from the columns (the attributes). Within ad tech, there can be many nefarious causes for a suboptimal sample, such as the presence of bots or some sort of impression or seed bias. Here, we’re going to look at one certain type of bias, geographical sampling.
Traditional approaches have always existed to uncover such biases, many of them being graphical in nature, but we have found that association rules are particularly useful when attempting to find relationships between thousands of variables. Association rules have proven so useful that we often find such models to be particularly robust; even when we are releasing models built through other methods such as a regularized regression, we’ll still build association rule sets just to help identify sample bias.
Before going into an example, let me define association rules:
Association rules have been used for quite some time, and although they are starting to make a strong comeback, they are considered somewhat old-school. Like many, I first encountered association rules years back in the retail business in the attempt to find commonalities between products (or sets of items that appear together). As an example, let’s presume that we find that 17% (known as the confidence) of internet users that have purchased men’s shoes and men’s ties online are also men’s belts purchasers. If, say, 1% of all transactions contain this particular item set, then we can also say that it has a 1% support. Note that association rule learning generally does not take into account the sequence of events within item sets, unlike some other methods. And, of course, ‘item sets’ are not necessarily limited to only retail POS items. They can contain user agent information, geo-demographics, user interests or just about any attribute or user action under investigation.
Now, let me give a rather specific example how association rules might be used to find sample bias:
eXelate was tasked with building a nation-wide predictive model for whom might be more likely to purchase a certain branded dairy-based consumer package good . As an initial check, we developed Association Rules that immediately made it clear to us that our sample, or specifically our dependent variable (or seed), was geographically biased toward the extended Chicagoland area, which we were then able to confirm independently. The team then removed geography from the independent variable set (or predictors) and, interestingly, a number of surrogate attributes for living in Chicagoland ‘popped’ – the most obvious of which were travel to and from Chicago and being a Chicago sports fan. But, a number of less obvious variables or combination of variables also proved significant that essentially identified a user as a Chicagoan such as working within the property and casualty industry while accessing the internet via certain ISPs. As this issue came up again and again across multiple clients and projects, we have since created an automated system for detecting geographical bias in our data as soon as we encounter them through association rules.
So, why is this important? Well, somewhat paradoxically, had we not removed these ‘highly-predictive’ geographic-based variables, we would have directed our client to target the exact wrong people. Roughly speaking, as the seed data was so overrepresented with the Chicagoland area, this would have been the (incorrect) rank order for targeting:
- People living in Chicago with attributes similar to those that already preferred the brand (a good target audience, for sure)
- People living in Chicago with attributes similar to those that in fact didn’t prefer the brand (the wrong audience; the very fact that they lived in Chicagoland was the most predictive attribute for being with in the seed list)
- And then, well down the list and easily ignored, people not living in Chicago with attributes similar to those that already preferred the brand (that is, the users we actually wanted)
As an aside, geographical underrepresentation can present problems as well. We had, for example, a very small sample of converters within Wisconsin and so the fact that a few of them had perhaps randomly liked this product made Wisconsin look like an excellent target audience. But, the sample was far too small to be reliable, so we needed to account for this as well.
Once we removed the geographic-based variables completely, a very logical and practical model emerged, including highly predictive prior shopping behaviors as well as a select set of demographic and interest data. And now, before any of our models are released, they are automatically and thoroughly checked for over- and under-geographic representation which we believe makes them far more robust.
So to wrap up a bit:
- Association rules are a method for finding commonalities between attributes, contained within item sets, and are measured by their support and confidence.
- Why – They are a critical tool to an analytics professional in order to sanitize a data set prior to driving conclusions.
- How – There are a number of different algorithms for developing association rules. We’ve tested a number of these and have settled on ariori, though we have automated and modified how this algorithm is applied in significant ways (perhaps, a topic for a later post).
- What – An example of utilizing an Association Rule is removing the geographic bias to help ensure that highly predictive and robust models can be created.
What types of association rules do you utilize? Let me know and let’s continue the conversation.