BeyeBLOGS | BeyeBLOGS Home | Get Your Own Blog

« The Shock of the New | Main | Data Quality and Data Stupidity »

January 16, 2007

Surnames and nicknames, Mr. Trump, an unusual phonebook, and data quality.

How much more can I cover in one post?

Jill Dyche's blog nearly always makes me smile. Her latest - about her surname and the problems of customer recognition - raised some familiar (and familial) issues for me. There is a BI point to this, but be prepared for diversions on the way.

My own family are largely from the Isle of Lewis, off the Scottish coast. In our district, the population mostly speaks Gaelic, and the large extended families result in a small number of surnames: MacLeod, MacKay, Morrison etc. First names are also regular: Donald, Angus, Murdo, Iain for men: Mary, Catriona, Margaret for women, although many women (my own mother included) have feminized male names, such as Donaldina, Angusina, Murdina, even Torquilina in extreme cases.

First diversion. In the USA most people on first acquaintance want to shorten my name to Don. I'm not a Don, I'm a Donald, and never think of myself otherwise. The Donald - Mr. Trump - is similarly resistant to the short form. That's no surprise to me for his mother, Mary MacLeod, was from Lewis, too. Why do we resist the short form? Well, it simply does not work in Gaelic. In Domhnall the mhn is almost silent - being pronounced more like Doh-al. So the familiar form is generally Dolaidh - pronounced Dolly. I cannot see Dolly Trump saying "You're fired!" with quite the same authority, somehow.

Back to our business intelligence point. Given the small number of first and last names, knowing someone's name on the Isle of Lewis may not help you find them at all. Searching for Donald MacLeod in Lews, using the UK online phone book found 7 pages of results. Perhaps we could narrow it down by address. This would help, would it not, if delivering a package? Well, it might, except that in the countryside, houses are often numbered in the order they were built, rather than by their geographic order. So knowing Donald MacLeod lives at number 17 may still not help at all in finding the right person.

Of course, the community long ago found an answer to this - before the invention of the phone or the phone book. Most everyone, in addition to their name, has a patronymic or matronymic name that identifies their family. I am often referred to as Domhnall Dhomhnaill Bhain, (Donald of white-haired Donald) after my grandfather. However, even with this, some disambiguation may be required. As a result, many people - probably most men - have nicknames. I have friends known as Donald "Rufus" Murray and Iain "Pluto" MacIver. Now we're getting down to a system that can identify individuals more accurately.

Second diversion. Nicknames are often stuck to us when children, but some kids arrive at school without them. In such cases, with perhaps 5 Donald MacLeod's in the class, the teachers feel a need of them, even if families don't. There used to be a newscaster in the UK known as Donnie B MacLeod. The B stood for nothing: except that on his first day at school, the teacher named his pupils Donnie A, Donnie B, Donnie C. Not very imaginative, but the name stuck for life.

Back to the identity problem. So, if the only way to disambiguate someone is to use their nickname or patronymic, how do you find them in the phone book. The answer is: create your phone book, listing people by their nicknames. It's a wonderful publication, and it's available here:
http://www.c-e-n.org/merchandise_2.htm

Now, to be fair, some people don't need nicknames. My father was not from the island, so Donald Farmer stands out, as would Donald Trump. And when an enterprising Asian shopkeeper, Wali Muhammed, arrived from Pakistan in the 60s, his mail was most likely delivered correctly.

The point, which I have to make to clients surprisingly often, is that identifying individuals is not simply a matter of finding a high similarity between two records, but in having a high confidence that the matched record really is the right one.

At the simplest, in SQL Server Integration Services, we have a string matching algorithm returning scores of both similarity and confidence. Donald MacLeod at number 19 may be matched with an incoming record with a very high similarity, but could really be a quite different Donald MacLeod at number 10: similarity is high, but confidence is low. However, Willie Mahamed may be matched with Wali Muhammed with a relatively low similarity, but a high confidence that we have found the right person.

Naturally, real cases should be more complete, and therefore less error prone. But again and again, I find companies, otherwise quite serious about their customer records and CRM approach, using a small number of attributes, and a naive approach to matching.

And with that, I must go and send an email to Pluto.

Posted by Donald Farmer at January 16, 2007 9:30 PM

Comments

Post a comment




Remember Me?