« December 2006 | Main | February 2007 »
January 30, 2007
Data Quality and Data Stupidity
"But this isn't about data quality, it's about data stupidity!" It was difficult to disagree. A customer had narrowly avoided dispatching several tons of architectural goods from Seattle to Portland, Maine rather than Portland, Oregon. The address cleansing system had done its best, and only the truck driver (I would prefer to say lorry, but must localize) noticed at the last moment. I had been telling the client for some time that his creaking data quality system was not adequate and that data quality is a lifecycle, a process, a metier, a way of life, a calling ... but for him it was just a function between processing orders and scheduling deliveries.
Here, on the b-eye network, some time ago, Claudia Imhoff was blogging about data quality. I liked this observation: "While there certainly is useful technology to help with data quality, so much of the assurance part is still heavily dependent on the human being (in this case, usually a business person) eyeballing the cleaned up data to verify its 'quality'."
Our truck driver did the eyeballing in this case, and good that he did. The costs of data stupidity can be high indeed, especially when business are so large that they cannot eyeball every order, or every action. Take the case of the Halifax Bank of Scotland. Stephanie MacLaughlin asked them for a copy of her bank statement. She received the bank statements of 75000 customers, delivered in 5 packages to her doorstep. HBOS said the incident was "isolated." I should hope so too. They might have added that it was stupid in the extreme.
(Readers of my post about Scottish names and addresses are perhaps wondering if the mistake is understandable. I assure you dear reader, that in a country of 5 million, there are not nearly 75000 Stephanie MacLaughlins.)
Of course, many businesses have quite careful rules in place for validating and catching small errors. However, in the cases described here, the errors were actually so great (each in their own way) that I doubt anyone had thought of putting a rule in place to catch the error. The HBOS error went unchecked until poor Ms McLaughlin received her mail. And there was no business rule preventing my customer from shipping to Portland, Maine, just a truck driver who, quite reasonably, did not think he could get there and back from Seattle in time for dinner.
The problem is how to check for all the various forms of stupidity that can arise. Often we can only do so retroactively. Indeed the mistakes are, as the bank said, isolated. Yet they are critically serious when they occur. I have no doubt that HBOS will now have a business rule implemented somewhere, anywhere, to check for such an issue again.
One approach I have been trying out with another customer - a car dealership - is to use a data mining model to check orders before shipment. The mining APIs in SQL Server 2005 are fast enough to be called on an order by order basis, and can be embedded into apps. The business simply builds a clustering model from its existing records, and new orders are checked to see if they fit the known clusters of order types. If they do not, they can be flagged and reviewed. The system is excellent for capturing stupidity. In fact, running it against historical records, it correctly identified over 90% of orders that had to be subsequently cancelled, returned or corrected due to errors, including orders for 1000s of parts instead of 10s, and many over- and under-billings.
I think this idea has legs. Perhaps my next book should be "Data Quality: The Stupidity Dimension."
Posted by Donald Farmer at 12:00 AM | Comments (0)
January 16, 2007
Surnames and nicknames, Mr. Trump, an unusual phonebook, and data quality.
How much more can I cover in one post?
Jill Dyche's blog nearly always makes me smile. Her latest - about her surname and the problems of customer recognition - raised some familiar (and familial) issues for me. There is a BI point to this, but be prepared for diversions on the way.
My own family are largely from the Isle of Lewis, off the Scottish coast. In our district, the population mostly speaks Gaelic, and the large extended families result in a small number of surnames: MacLeod, MacKay, Morrison etc. First names are also regular: Donald, Angus, Murdo, Iain for men: Mary, Catriona, Margaret for women, although many women (my own mother included) have feminized male names, such as Donaldina, Angusina, Murdina, even Torquilina in extreme cases.
First diversion. In the USA most people on first acquaintance want to shorten my name to Don. I'm not a Don, I'm a Donald, and never think of myself otherwise. The Donald - Mr. Trump - is similarly resistant to the short form. That's no surprise to me for his mother, Mary MacLeod, was from Lewis, too. Why do we resist the short form? Well, it simply does not work in Gaelic. In Domhnall the mhn is almost silent - being pronounced more like Doh-al. So the familiar form is generally Dolaidh - pronounced Dolly. I cannot see Dolly Trump saying "You're fired!" with quite the same authority, somehow.
Back to our business intelligence point. Given the small number of first and last names, knowing someone's name on the Isle of Lewis may not help you find them at all. Searching for Donald MacLeod in Lews, using the UK online phone book found 7 pages of results. Perhaps we could narrow it down by address. This would help, would it not, if delivering a package? Well, it might, except that in the countryside, houses are often numbered in the order they were built, rather than by their geographic order. So knowing Donald MacLeod lives at number 17 may still not help at all in finding the right person.
Of course, the community long ago found an answer to this - before the invention of the phone or the phone book. Most everyone, in addition to their name, has a patronymic or matronymic name that identifies their family. I am often referred to as Domhnall Dhomhnaill Bhain, (Donald of white-haired Donald) after my grandfather. However, even with this, some disambiguation may be required. As a result, many people - probably most men - have nicknames. I have friends known as Donald "Rufus" Murray and Iain "Pluto" MacIver. Now we're getting down to a system that can identify individuals more accurately.
Second diversion. Nicknames are often stuck to us when children, but some kids arrive at school without them. In such cases, with perhaps 5 Donald MacLeod's in the class, the teachers feel a need of them, even if families don't. There used to be a newscaster in the UK known as Donnie B MacLeod. The B stood for nothing: except that on his first day at school, the teacher named his pupils Donnie A, Donnie B, Donnie C. Not very imaginative, but the name stuck for life.
Back to the identity problem. So, if the only way to disambiguate someone is to use their nickname or patronymic, how do you find them in the phone book. The answer is: create your phone book, listing people by their nicknames. It's a wonderful publication, and it's available here:
http://www.c-e-n.org/merchandise_2.htm
Now, to be fair, some people don't need nicknames. My father was not from the island, so Donald Farmer stands out, as would Donald Trump. And when an enterprising Asian shopkeeper, Wali Muhammed, arrived from Pakistan in the 60s, his mail was most likely delivered correctly.
The point, which I have to make to clients surprisingly often, is that identifying individuals is not simply a matter of finding a high similarity between two records, but in having a high confidence that the matched record really is the right one.
At the simplest, in SQL Server Integration Services, we have a string matching algorithm returning scores of both similarity and confidence. Donald MacLeod at number 19 may be matched with an incoming record with a very high similarity, but could really be a quite different Donald MacLeod at number 10: similarity is high, but confidence is low. However, Willie Mahamed may be matched with Wali Muhammed with a relatively low similarity, but a high confidence that we have found the right person.
Naturally, real cases should be more complete, and therefore less error prone. But again and again, I find companies, otherwise quite serious about their customer records and CRM approach, using a small number of attributes, and a naive approach to matching.
And with that, I must go and send an email to Pluto.
Posted by Donald Farmer at 9:30 PM | Comments (0)
January 4, 2007
The Shock of the New
What a start to the New Year! Few things could have made me smile more than to see the Intelligent Enterprise Reader's Choice Awards. Microsoft over all had a great showing, but for me the highlight was to see Microsoft taking the "Best ETL Software" gong. Let me tell you why it was especially sweet ...
About 6 years ago, I moved from startups to work at Microsoft, in the SQL Server BI group. The change was fascinating for many reasons - many good and some ... well, let's just say they were still fascinating. One thing was enjoyable and difficult in equal measure: I now had to deal with the somewhat fixed expectations of thousands of vocal users about what our product should do and how we should do it. The installed user base - especially one as large as Microsoft's - can overwhelm you with requests and suggestions. In fact it would be easy to be merely reactive and spend entire development cycles polishing scratches and filling dents as customers point them out.
However, at some point, if you want to move the software, the business, and the customers along significantly, you have to take the plunge and make some radical changes. But you know that doing so will cause some pain to the existing, loyal and even enthusiastic users.
In SQL Server Integration Services we faced this problem in bucketloads. The previous product, DTS, was lightweight but smart, and hugely popular. However, it was also very limited in its capacity to tackle the increasing demands placed on it by existing and potential customers. A complete, ground-up, no-line-of-code-left-standing rewrite was in order. And, as it turned out, the market needs forced architectural changes that made ugrading from the previous version almost impossible.
Naturally, we tackled the resulting issues on a technical level to some extent - but more importantly we had to tell a compelling story to users that the changes and their pain was worth it.
In that light, I can look to the award as an endorsement of the decisions we made, and the astonishing commitment of the team that drove the product along. Especially so, as the award comes from users who have to live and work with the software on a daily basis and therefore represent both the base we had to move, and the basis for our next versions. It's a great start to the New Year for the team, and for me personally a good time to look back and reflect on the long road, and all the various detours and diversions that we passed on the way.
Posted by Donald Farmer at 4:30 PM | Comments (0)
