« July 2006 | Main | December 2006 »
August 21, 2006
Ontology - a dimension of data quality
In a previous post I described some premises of data quality that I use to classify data quality issues.
The premises were that data quality could be:
Orthothetic - determined by prescribed external rules, such as zipcodes or ISBN numbers.
Allothetic - determined by external rules, but not prescribed, such as nicknames.
Nomothetic - determined by internal rules and patterns, such as the requirement for every employee to have a standard title.
Idiothetic - determined by a rule applied to a single object, for example: that a given employee title is from a standard set.
I have been using these premises with the more common classification of data quality dimensions. There are numerous versions of the dimensions of data quality, but mostly I deal with the following 5 dimensions that, for me, cover most ground: accuracy, consistency, security, timeliness, and completeness.
Recently I have been working with a 6th dimension - ontology. So what does this mean?
Let's say a package arrives at Microsoft, addressed to "Ronald Farmer." Is it intended for me, Donald Farmer? If so, then the data is inaccurate - it's quality is poor on the accuracy dimension. If addressed to "D. Farmer" it may be incomplete and ambiguous, and so on.
But there is another possibility - it could be that "Ronald Farmer" is indeed the intended addressee, but he does not work at Microsoft. Or perhaps the package is addressed to "Donald Duck" because some smarty-pants has filled out the web form just to get past the registration screen. (How many webcasts requiring registration are attended by Mickey Mouse?)
In such cases, it can be useful to think of these data quality issues not as flaws in accuracy or any of the other dimensions already described - but as flaws in ontology. The entity being referred to does not exist? A representation of an entity that does not exist is subtly, but importantly different, from an inaccurate representation of an entity that does exist.
Here's another example. I live at 15336 164th Avenue. That's a valid address and the plat exists. 15337 in the same street may also be parsed as a valid address, but the plat does not exist.
When applying data quality rules it may be useful to test for ontological correctness, especially when checking otherwise seemingly valid data. Ontology tests could also be useful to avoid over-cleaning data when removing approximate duplicates. Many of you will do this already by testing to see if a record exactly matches a known entry, and only using fuzzy deduplication on the unknown entries - that is a form of testing data on the ontology dimension.
My thoughts around Ontology as a quality dimension were sparked off by a discussion I had with Denise Draper. Denise is the new manager of the Integration Services team in SQL Server. Her background is in EII - remember Nimble? - but I am sure you'll hear a lot more of her in the BI space too. And I am certainly looking forward to ever more insightful conversations. Welcome aboard, Denise.
Share:
Posted by Donald Farmer at 10:15 PM | Comments (0)
