« June 2006 | Main | August 2006 »
July 13, 2006
Who SAYS my data is bad? Some premises of Data Quality
I'm a classifier. Many of us are in the world of BI. Much of my thinking involves breaking subjects into their typographies and taxonomies in order to understand them better. (And then building them up into broad generalizations again in order to market them, but that's a different matter.) I like being able to name things. I can admire a woodland walk for it's beauty, but I enjoy it even more if I know that the little white flower is a Few-Flowered Leek, or that the Lesser Celandine's are beginning to turn colour.
So, being passionate about data quality, I am naturally interested in classifying data quality issues. Nor am I alone - many papers have been written detailing what errors can occur in data, and the myriad ways in which records and datasets can have poor quality. One of the most effective and most popular has been to talk about dimensions of data quality - accuracy, consistency, timeliness, completeness etc. And within these dimensions we can talk about the different types of, say, inaccuracy: invalid values, incorrect representation and so on.
I'm interested in classifying these issues from another angle. I can summarize the approach (whence the title of this post) as who says this value is invalid? What does it mean to say that Redmond, WA, 98072 is inaccurate, or that Dave is not a short form of Christina for the purposes of record matching?
I think of this approach as being about premises of data quality - not just the ways in which data manifest poor quality, but the bases which we determine what quality itself means in such cases.
As I said, I'm a classifier, so I have been doing some classifying of these data quality premises of data, and have defined four that I find particularly useful. (I'm also terribly affected, so they are named with Greek derivations. There is an advantage to that, in that it is possible to tag each premise with a very descriptive name.) Here they are ...
1. Orthothetic data quality
Think of orthodox. Orthothetic data quality is concerned with external, prescriptive rules. For example, the correct zip code for my US address is prescribed by the US Postal Service. They determine the rule and they can change the rule.
2. Allothetic data quality
Allo- means other, and I use allothetic to describe data quality rules that come from outside your control, but which are not prescribed by anyone. For example, you need to be aware when householding that as a general rule Bob is a short form for Robert, but there is no prescriptive Federal Bureau of Nicknames(at least I hope not - who knows these days! In France they used to keep a list of acceptable names for registering babies at the Maison de Ville.)
3. Nomothetic data quality
Nomo- typically prefixes forms relating to general laws, especially emergent laws. So nomothetic quality is where data is judged against rules that emerge from patterns or rules - but the rules are within your control. For example, in Microsoft, all employees have a Standard Title. That is a rule that we can control: no one externally (not even the European Union) tells us it must be so. On the other hand, a nomothetic pattern could be used in healthcare to determine that the 14-year old patient receiving heart medication may be a mis-keyed entry, but the 41-year old on the same drug is probably a valid entry.
4. Idiothetic data quality
Nothing to do with idiots entering data. Well, maybe in some cases. Idiothetic simply means that the quality of your data is premised on a rule that applies at the grain of a single object. For example, take our Standard Title for Microsoft employees. It is a nomothetic rule (applying across entire data sets) that employees have such a designation: it is an idiothetic rule that Group Program Manager is a valid Standard Title, while Chieftain of Data Integration is not.
You can see that these premises are to an extent composable: it is possible for data quality to have multiple premises. But in general, you'll find that the order in which I give them here is hierarchical. For example, it could be regarded as nomothetic that every customer has a zip code, and idiothetic that such a code is in the format nnnnn-nnnn. However, the orthothetic rule most likely wins out - being a proper US post code is more important than meeting the rule some DBA codes into the field, because it is the postal service and not the DBA that has authority in this domain.
This is an important consideration when looking at toolsets and techniques for your data quality processes. I may well be able to validate the format of a postcode column idiothetically using format strings in my field definition, but perhaps I would be better having an orthothetic tool that validates against a full set of prescribed rules.
I'll write more about this approach in the future - I think it has a lot of potential. Meanwhile, it would be great to hear your comments. Does this sound promising as a new way of thinking about quality issues?
Share:
Posted by Donald Farmer at 11:45 AM | Comments (0)
