BeyeBLOGS | BeyeBLOGS Home | Get Your Own Blog

« I'm biased. And so are you. | Main

October 25, 2009

Simpson's Paradox and a Data Quality problem

http://www.dataflux.com/dfblog/One of favourite writers on matters of data quality, is David Loshin. (You know who the other one is, Frank!) David blogs, in good company, over at the excellent DataFlux Community of Experts - be sure to subscribe to the feed.

Back in September, David published an interesting blog on applying Pareto's principle to data cleansing and other systemic improvements. As he summarizes it "there is some point where the incremental value you get is not worth the investment ... the level of effort to get incremental improvements is greater than the value generated by having the improvement."

This reminded me of a paradox which I wielded recently, in persuading a customer to tackle some data quality issues. The proposition is Simpson's Paradox, and, perhaps because it reminds me of Simpson's Hospital in Edinburgh, I always explain it first in medical terms. Here goes . . .

When comparing results of a difficult operation, Hospital A has a 75% survival rate and Hospital B has a 90% survival rate. Which is the better hospital? Which would you choose? It could well be Hospital A.

Truth is, we don't have the data to make a decision. Hospital A may be in a poorer part of town, with patients in worse general health and presenting with more advanced symptoms. Hospital B, on the other hand, has well-insured patients, benefitting from good health and regular screening. The surgeons in Hospital A may in fact, by every measure, be better than those in Hospital B yet Hospital A could still have a lower survival rate.

How does this apply to data quality? Well, try a thought experiment where you replace surgery with a data quality process, and the health of the patients with the initial quality of your data.

In the case of my customer, faced with a limited budget, they had to choose between two different data cleansing initiatives. They were a long-standing supplier of building components in the mid-west and had a well-established B2B customer list. However, that customer list was riddled with inaccuracies. Faced with a changing market, they were sure they had to improve the impact and accuracy of their B2B direct mails. A vendor offered them a low-cost solution to cleaning up their mailing lists, promising a remarkably high degree of accuracy. However, they had another problem: their product database was also outdated, with many discontinued products and categorizations that were no longer in line with industry practices.

The decision of which data to tackle was difficult, however, because improving the product catalog would be a tough job. Think of all the products available in Home Depot, and all their possible categorizations by product type (hardware, lighting), project type (bathroom, kitchen), supplier and so on. Moreover, there were few tools to help. Perhaps getting the entire catalog up to high quality would be impossible on their budget and timescale. Address cleansing on the other hand held out the promise of this high "survival rate." In fact, the promised 95% accuracy was a very tempting number.

Together with the customer, we considered the options, and I explained Simpson's paradox. It is not an exact parallel, but it helps to illuminate these issues. A high-quality mailing list backed by a poor quality product catalog would make good execution on leads and sales difficult. A moderate-quality mailing list, backed by an improved catalog would enable better execution, but there would still be some overspend on mailings and contacts. Nevertheless, improving the product catalog to 80% accuracy (it was bad!) could prove to be a better investment than improving the mailing list to 95% accuracy.

In the end, a healthy order for a discontinued product proved to be the deciding factor. The customer is now upgrading their product catalog.

Posted by Donald Farmer at October 25, 2009 11:15 PM

Comments

Post a comment




Remember Me?