The Lyons Den: How Do You Measure Data Accuracy?

kevin lyonsThis week, eXelate unveiled the second white paper in our Smart Data series, accurate data is smart data. We believe that Big Data has yet to fulfill its promise of providing a clear and consistent business advantage. To address this, we argue that the industry needs to evolve from Big Data to Smart Data – data that drives business value because it is accurate, actionable, and agile. The accuracy white paper answers why accuracy is important, how to gauge it, and finally, what it takes to implement data accuracy.

exelate smart data

Which brings us to today’s topic, which builds on the white paper: How do you measure accuracy?

In measuring accuracy, it seems to me that many confuse accuracy and precision. Some, for example, might argue that a consensus approach which polls data providers is the best way to judge data accuracy. This approach essentially claims that attributes with the highest consensus across data providers is the most accurate. We categorically reject this simplistic view as it equates agreement – or precision – with accuracy. In fact, checking multiple data sources against one another may create a kind of confirmation bias (the tendency for people to believe data that supports what they already believe to be true).

This can happen when multiple vendors have the same technique to collect data. Take income data, for example. If several data vendors are supplying the mean income for a zip code to individual users, each will agree for a given household, but each will be wrong for most of the households. And, interestingly, if five data vendors applied the zip code mean to a specific user and one supplied the actual income to the same user, the actual income would be discounted as inaccurate. Serial correlation in data streams is often misinterpreted as accuracy. That is, everyone agreeing doesn’t necessarily make it right; it may just be that many data sources are wrong for the same reason – and that is why the consensus approach breaks down. Precision is not accuracy.

We therefore believe that online data must be validated with gold-standard, independent third party sources, and that these sources must contain registered (user-level) audience demography. Validating eXelate data against third party sources such as comScore and Nielsen allows us to verify that our data achieves the greatest possible balance of scale and accuracy, without degrading our data by including overly loose criteria.

There are a couple of very straightforward ways to validate the accuracy of data against a gold standard. For binary (either/or) outcomes, a confusion matrix is probably the most common approach. Let’s look at the simple case of gender. The below chart represents a simplified confusion matrix which allows us to judge the accuracy of our data:

equation1

The above chart reads as follows.  The rows represent what eXelate believes about a user. The columns are the “gold standard” against which we are being evaluated. So, from the perspective of correctly identifying males, there are four potential outcomes:

  • Accurate (male) – technically known as “true positive” – means that both eXelate and the “gold standard” knows the user to be male; or, we got it right.
  • Accurate (female) –  or “true negative” – means that both eXelate and the “gold standard” knows the user to be female,  so again we got it right (it’s a “true negative” because from the perspective of identifying males, we’re technically agreeing that this user is not male).
  • Inaccurate (actually male) – or “false negative” – means that eXelate believes the user to be female, but they are actually male; or, we identified this user incorrectly.
  • Inaccurate (actually female) – or “false positive” – means that eXelate believes the user to be male, but they are actually female; wrong again.

In the above, accuracy is defined as,

equation2

So, if you saw the following results:

equation3

your accuracy would be:

equation4

meaning that, overall, you were right 81% of the time.

To underscore the fact that precision is not equal to accuracy, even in technical terms, we can note that precision for males is defined as:

equation5

meaning that if you showed an ad to users that you thought were males, you would be on target  83% of the time.

These equations, as well as something called recall (of males, how many did we reach?), are KPIs eXelate employs to continuously evaluate and improve our data.

So, to summarize, accuracy matters!  Everyone in our business needs to understand what it is and what it is not. Accuracy needs to be calculated using accepted methods against gold standards. And those that fail to give it the attention it deserves do so at their own risk!

We’ll have much more to say on Smart Data and I encourage you to read our Smart Data series of white papers and continue to follow us @eXelate.

The Lyons Den: Removing Sample Bias via Association Rules

kevin lyonsWelcome to the inaugural issue of The Lyons Den, a monthly blog dedicated to Big (or rather, Smart) Data and the analytics challenges facing our industry.  Within each of these blogs, I will take up the ‘why,’ ‘how,’ and ‘what’ of the many challenges that face us, all the while letting the reader see a little bit under the analytical hood here at eXelate.

Just a little about myself – I am an analytic solutions and marketing technology specialist.  As SVP of Analytics here at eXelate, I am responsible for leading the vision and execution of the company’s data science and marketing analytics activities.  Prior to eXelate, I was the VP of Analytics and Business Intelligence at x+1, where my team strove to maximize website profitability via analytics and real-time decisioning.  And before x+1, I spent over a decade in web and marketing analytics at Harte-Hanks, a large marketing service provider.  The rest of my background is here.

And so, without further ado, today’s topic – Removing Sample Bias via Association Rules.

If you’re like us, 80% to 90% of your analytical challenges ultimately derive from the rows (the sample) and only about 10% to 20% from the columns (the attributes).  Within ad tech, there can be many nefarious causes for a suboptimal sample, such as the presence of bots or some sort of impression or seed bias.  Here, we’re going to look at one certain type of bias, geographical sampling.

Traditional approaches have always existed to uncover such biases, many of them being graphical in nature, but we have found that association rules are particularly useful when attempting to find relationships between thousands of variables.  Association rules have proven so useful that we often find such models to be particularly robust; even when we are releasing models built through other methods such as a regularized regression, we’ll still build association rule sets just to help identify sample bias.

Before going into an example, let me define association rules:

Association rules have been used for quite some time, and although they are starting to make a strong comeback, they are considered somewhat old-school. Like many, I first encountered association rules years back in the retail business in the attempt to find commonalities between products (or sets of items that appear together). As an example, let’s presume that we find that 17% (known as the confidence) of internet users that have purchased men’s shoes and men’s ties online are also men’s belts purchasers. If, say, 1% of all transactions contain this particular item set, then we can also say that it has a 1% support.  Note that association rule learning generally does not take into account the sequence of events within item sets, unlike some other methods. And, of course, ‘item sets’ are not necessarily limited to only retail POS items. They can contain user agent information, geo-demographics, user interests or just about any attribute or user action under investigation.

Now, let me give a rather specific example  how association rules might be used to find sample bias:

eXelate was tasked with building a nation-wide predictive model for whom might be more likely to purchase a certain branded dairy-based consumer package good .  As an initial check, we developed Association Rules that immediately made it clear to us that our sample, or specifically our dependent variable (or seed), was geographically biased toward the extended Chicagoland area, which we were then able to confirm independently.  The team then removed geography from the independent variable set (or predictors) and, interestingly, a number of surrogate attributes for living in Chicagoland ‘popped’ – the most obvious of which were travel to and from Chicago and being a Chicago sports fan.  But, a number of less obvious variables or combination of variables also proved significant that essentially identified a user as a Chicagoan such as working within the property and casualty industry while accessing the internet via certain ISPs.  As this issue came up again and again across multiple clients and projects, we have since created an automated system for detecting geographical bias in our data as soon as we encounter them through association rules.

So, why is this important? Well, somewhat paradoxically, had we not removed these ‘highly-predictive’ geographic-based variables, we would have directed our client to target the exact wrong people.  Roughly speaking, as the seed data was so overrepresented with the Chicagoland area, this would have been the (incorrect) rank order for targeting:

  • People living in Chicago with attributes similar to those that already preferred the brand (a good target audience, for sure)
  • People living in Chicago with attributes similar to those that in fact didn’t prefer the brand (the wrong audience; the very fact that they lived in Chicagoland was the most predictive attribute for being with in the seed list)
  • And then, well down the list and easily ignored, people not living in Chicago with attributes similar to those that already preferred the brand (that is, the users we actually wanted)

As an aside, geographical underrepresentation can present problems as well.  We had, for example, a very small sample of converters within Wisconsin and so the fact that a few of them had perhaps randomly liked this product made Wisconsin look like an excellent target audience.  But, the sample was far too small to be reliable, so we needed to account for this as well.

Once we removed the geographic-based variables completely, a very logical and practical model emerged, including highly predictive prior shopping behaviors as well as a select set of demographic and interest data.  And now, before any of our models are released, they are automatically and thoroughly checked for over- and under-geographic representation which we believe makes them far more robust.

So to wrap up a bit:

  • Association rules are a method for finding commonalities between attributes,  contained within item sets, and are measured by their support and confidence.
  • Why – They are a critical tool to an analytics professional in order to sanitize a data set prior to driving conclusions.
  • How – There are a number of different algorithms for developing association rules. We’ve tested a number of these and have settled on ariori, though we have automated and modified how this algorithm is applied in significant ways (perhaps, a topic for a later post).
  • What – An example of utilizing an Association Rule is removing the geographic bias to help ensure that highly predictive and robust models can be created.

What types of association rules do you utilize? Let me know and let’s continue the conversation.
Thanks,

Kevin

More Data or Better Algorithms? Not So Fast.

As Seen In: All Things D

It’s always a pleasure to follow a stimulating debate between great thinkers within our industry. Recently, Rocket Fuel’s CTO Mark Torrance wrote “Better Algorithms Beat More Data — And Here’s Why” in direct response to BlueKai CEO Omar Tawakol’s piece, “More Data Beats Better Algorithms — Or Does It?” For his part, Mr. Torrance takes Mr. Tawakol to task for asserting that more data trumps algorithms, but that’s not quite Mr. Tawakol’s point. In fact, the bulk of his article explores the importance of having an algorithm that connects disparate data points, giving them enhanced meaning and usefulness through better context.

Still, I think we can all safely agree that both data and algorithms are absolutely necessary to complete any analytical project. But, regardless, the real point of any successful analytics project is to help an organization achieve a specific business goal. In that light, I hope we can also agree that marketing success is actually driven by four primary considerations: business acumen, data, algorithms and operations. Let’s take a look at each.

Business Acumen

By getting caught up in the data and the math, we can easily forget that analytics projects live and die on business knowledge. Analytics projects, therefore, must always begin with a clear business goal. What does this campaign seek to accomplish? What activities or actions does the marketer wish to encourage or measure? What does the organization already know about key prospects? And what pitfalls will marketers need to anticipate and avoid?

The answers to such questions will influence the other three analytics drivers. For example, digital media optimization models require a “dependent” variable (data), which is often expressed in terms of converters and non-converters. Naturally, the stated business goal will drive which users are deemed “converters.” If some arbitrary mismatch exists between goals and definition, the campaign may very well fail.

Data and Algorithms

I linked these two drivers together to emphasis the point that data and algorithms must be used in tandem. In truth, data scientists spend much of their day employing and refining algorithms. Yet I can’t quite accept Mr. Torrance’s example concerning how best to select a marriage partner if the goal is to produce tall and healthy children. The simple algorithm, he says, might be to marry the first suitor who’s over six feet tall. You could add more data, such as a threshold for strength, to get better results, he says. But for best results, a better algorithm is what’s needed. He writes:

“Measure the height of the first third of the people I see, and marry the next person who is taller than all of them. This algorithm improvement has a good chance of delivering a better result than just using more data with a simple algorithm.”

I am not so sure I agree. Without a doubt, a perfect algorithm that considers height alone will select one of the tallest suitors in the community. But is that sufficient to achieve the stated goal of tall and healthy children? What if that marriage partner happens to have a transmittable genetic condition, or is horribly grumpy? Wouldn’t knowing more about the partner (i.e. more data types) lead to a healthier life for all involved?

Of course, we can’t assume that the more data you have, the better off you’ll be. Useful information that you didn’t have before (orthogonal data, in analytics speak) trumps unlimited data for the simple reason that not all data are created equally. Here’s an example of how that’s the case:

Let’s say you’re building a model that will help you find likely prospects for a new luxury sedan. Now let’s say your analytics model begins with one known input: Household income. Given a choice between additional data and a better algorithm, which should you choose? Additional information — purchase intent — say, will tell you something you didn’t know before. It will be useful therefore to know if a user is interested in purchasing high-end vehicles. But, given a choice between a better algorithm and adding new data such as individual assets under management, the better algorithm may be the best approach. Purchase intent represents new information and insight directly relevant to the business goal. But another measure of affluence? Not so much.

Operations

All the best data and algorithms will be for naught if analytics isn’t fully embedded and widely distributed within appropriate marketing systems, so that the analytics can be directly leveraged whenever and wherever it’s most needed. At eXelate, we recognize that platform flexibility is a critical component for realizing digital media success via cross channel marketing execution.

One last point I’d like to make has to do with the value of contextual and behavioral data. In his article, Mr. Torrance writes, “At Rocket Fuel, we’re big believers in the power of algorithms. This is because data, no matter how rich or augmented, is still a mostly static representation of customer interest and intent.” As far as I can tell, Rocket Fuel therefore sees all data as contextual data, limited to the time of the marketing event.

While it’s of course possible to ignore the time component of data, doing so throws away vital information that an appropriate algorithm could easily digest. In a very simple example, treating someone as “auto intender” or not is far less powerful that tracking a user’s behavior over time to better understand how their intents and interests evolve. That is, unlike contextual data, behavioral data is by its very nature directly linked to longer-term patterns of user behavior. As a result, behavioral data are far better at driving long-term value across a variety of campaigns, especially branding efforts that address personal aspirations.

In a 2011 article entitled “Consumers Are People Too…,” I argued that when asking which data are better, behavioral or contextual, the only right answer is both. And therefore, I reject any absolutes, because marketers clearly will benefit from a variety of data types.

In conclusion, if you’re asked to pick between more data or a new and better algorithm, which should you choose? The proper response is that business acumen, varied information, great algorithms and operations are all critical to the success of an audience model.

eXelate Index: Spike in iPhone Intenders late August, early October

It’s no surprise that consumers are swayed whenever a major product or company has breaking news, and Apple is no exception. In fact, Apple seems to be the premiere leader of this trend, as one of the top players in the technological world. With the anticipation of what most assumed to be the iPhone 5 over the summer, techies were anxious to get their hands on Apple’s newest addition, and even more so when any news mentioning the phone was released.

On August 31st, in an incident reminiscent of a similar problem a year earlier before the iPhone 4 was released, an Apple employee left a top-secret prototype of the newest iPhone in a bar in San Francisco. Once the news broke, the spike of Apple iPhone intenders rose. Afterwards, the intenders stayed relatively steady as they waited for news of the newest phone’s actual release, to be announced sometime in late September/early October. Another spike in intenders occurs on October 4th, when Apple officially announced the newest iPhone at its special media event – not the 5 as anticipated, but the 4S.

The high spike in intenders continues and is even higher the following day, following the untimely death of Apple CEO and visionary Steve Jobs, whom we, in the digital media business, greatly admired and owe a great deal of gratitude.

Click image below to enlarge.

eXelate Index: Consumers intending to purchase Mini, SmartCar, are more likely to purchase Fiat

Fiat reintroduced its brand to the United States in March with the 500, a subcompact that will remain their main offering, even when they bring in the Alfa Romeo brand and larger cars. Although these lines may not be seen in the US until 2013, this plan will include opening about 25 more Fiat dealerships in the US, in addition to the 124 that are already open, by the end of this year. This will give the Fiat 500 an opportunity to further grow itself in the US market.

At eXelate, our analytics reported that people that are Mini,SmartCar, and Scion ‘Intenders” are far more likely than average to also be Fiat Intenders. That said, those people interested and searching for any of these brands will more than likely look at Fiat as a purchase option, as well. This is due to a couple different reasons.

People who are Scion, SmartCar, and Mini Indenders (as well as Honda, Toyota, and Nissan) have socio-economic characteristics that are similar to that of  Fiat Intenders. Intenders of Asian and European Cars have relatively similar socio-economic characteristics, and both are relatively different than intenders of US makes.

Depending upon required scale and desired ROI for Fiat, a variety of Nielsen PRIZM codes might be selected for targeting purposes. Some of the best examples include:

PRIZM – 04 – Young Digerati

PRIZM – 16 – Bohemian Mix

PRIZM – 07 – Money and Brains

PRIZM – 31 – Urban Achievers

While avoiding PRIZM codes such as,

PRIZM – 57 – Old Milltowns

PRIZM – 37 – Mayberry-ville

PRIZM – 20 – Fast-Track Families

PRIZM – 50 – Kid Country, USA

PRIZM – 51 – Shotguns and Pickups

eXelate Index: Interest in eReaders and TouchPads Spikes in August

Wikipedia recently updated the HP TouchPad article page as a result of the August 18, 2011 announcement that less than 7 weeks after the TouchPad was launch in the United States, Hewlett-Packard would discontinue all current hardware devices running webOS.

The post states:

“The HP TouchPad, if it were less expensive, could be an extremely strong, if slightly less polished, alternative to the iPad. But like other recently-released high-profile Android tablets, it’s determined to take on the champ. And just like those Android tablets, it’s hard to recommend over an iPad at the same price.”[8]

On August 16, 2011, it was reported that Best Buy had only sold 25,000 of 270,000 devices that it had in its inventory and was refusing to pay HP for the rest.[33] In Europe, the Touchpad was estimated to have sold 12,000 in its first month of release with sales slowing significantly in August. Industry commentators suggested that the lack of apps for the platform was hindering sales.[34]

On August 18, 2011, HP announced in a press release that it will discontinue all webOS devices and is considering spinning off its personal computer unit. HP stated that it would “continue to explore options for webOS”.[35] In addition to disappointing sales, poor hardware performance may have been another reason for HP management’s decision to discontinue the TouchPad.

On August 19, 2011, HP allowed retailers to sell all remaining stock at extremely low prices. In the USA, the price was $99 for the 16GB model and $149 for the 32GB model.[9] Large numbers of buyers acquired the TouchPad at “fire sale” prices.[36] Most brick-and-mortar retailers reportedly sold out their entire inventories within hours in the morning of August 20.[37] Online retailers, including Barnes and Noble, Amazon, and Best Buy, faced massive backlash from angry customers when they offered the tablet on their sites at $99 on August 22, sold out their inventories in record time (in case of Barnes and Noble, in less than an hour), and were forced to cancel many subsequent orders.[38]

On August 22, 2011 a fire sale similar to those in the U.S was held in Australian Harvey Norman stores. Although the event was not advertised and staff at many stores were not informed until midday, stores in several states were sold out within an hour.[39][40] Similar sales also started in the UK with several stores reducing prices at 6pm to match the US (£89 for the 16GB and £115 for the 32GB). Most sold out in minutes.[citation needed]

According to the eXelate Index data, we saw a spike too. There was a clear interest in eReaders (including TouchPads) during the dates mentioned.

eXelate Index: US Debt Crisis- Online Interest in Credit Card Offers Continues to Increase

As the nation remains in the grips of a debt crisis and as the prospects for a double-dip recession continue to worsen, it looks as though consumers are increasingly interested in their own personal debt as well.

Online interest levels in credit card offers have steadily and dramatically increased over the last week, having more than doubled in just the last seven days.

August 2011- Online Credit Card Interest