BeyeBLOGS | BeyeBLOGS Home | Get Your Own Blog

« Stresses On Our Data | Main | Profile the Target »

May 4, 2006

Data Modeling and Data Quality, A Circular Relationship

May 3, 2006

Here is a common scenario: a client has 3 different data sources they want to merge into one data warehouse. They profiled all three data sources to see what they actually had, what was blank, what was partially populated, what were the formats, how did data conform to business rules, etc. In one of the columns of the input data sources, a column of part numbers, the client discovered over 15 different patterns of part numbers. The dominant pattern of NS2NAS2N is what they expected to find. To decipher this pattern an N is a numeric character, an S is a special character (/,-,), an A is an alpha character, and a number preceding an N, S, or A indicates the length of that sequence. So the actual part number for NS2NAS2N would look like 7-67A-90.

The client, however, did not expect to find 14 other patterns of part numbers, like 10N, AN8N, or just 3N. An analysis of these anomalies showed that three different families of products (home energy systems, aircraft engines, and power plants) were stored in the same table. Each family of product had its own subset of allowable part number patterns. The mishmash of products and parts effectively made the data in the table unusable for its original purpose of allowing service managers to collect data on aircraft engines so they could devise service contracts and rates that reflected the true failure rates of the parts.

If the table had been imported into the data warehouse the service managers' problems would have persisted. The fact is, the individual records collected in the table for the service managers were correct, but one table of product data had been provided for all the service managers, not just the aircraft maintenance managers.

So here we have the start of our circular relationship: good data plus a bad data model equates to unusable data. Fixing the problem, once it was discovered was simple enough. They created three separate tables, and even more after that, for each group of service managers to use for future data collection. The client could not achieve the level of data quality they needed for developing service contracts until the data model evolved to support it.

The converse is true. Take PW/102,4. The PW stands for the manufacturer code, the 102 is the major subassembly, and 4 is the revision number. The problem is the data is not parsed out or standardized so it could not be used for searching and sorting within the service contract system. Because it was combined together in one field, none of the individual components can be corrected if inaccurate. When the data is parsed out into its individual components each component can be stored in its own column. Here is the other part of the circle, bad data plus an adequate data model equals unusable data.

It is because of this relationship that I tell people that data modeling depends on data quality, and data quality depends on data modeling. The two are the Yin and Yang of data management.

Click to Read More

Share: del.icio.us Digg Furl ma.gnolia Netscape Newsvine reddit StumbleUpon Yahoo MyWeb  

Posted by Frank Dravis at May 4, 2006 1:09 AM

Comments

Post a comment




Remember Me?