BeyeBLOGS | BeyeBLOGS Home | Get Your Own Blog

« Data Quality - A Family Affair | Main | Fun with Recursive SQL (Part 1) »

December 10, 2009


If a man begins with certainties, he shall end in doubts;
But if he will be content to begin with doubts,
He shall end in certainties.
[Francis Bacon 1561-1626]

When I was learning to program in C and studying algorithms, the assert() assertion macro was one of my favorite debugging tools. Assert can be used to validate that something isn't going wrong that could send your program into left field during the execution of some procedure. For instance, a balanced binary search tree should never be more than log2(n) levels deep (or something similar to that based on the exact insertion algorithm), where n is the number of items in the tree. After a new item is inserted in the tree, you can assert(tree.depth() == log2(tree.count())). If that assertion fails, then you know the tree isn't staying balanced and the search performance guaranteed by a balanced tree isn't valid any more.

If that's too much computer science for you, hold on and see where this is going. There's relevance to this idea beyond low-level programming and computer science theory.

I've been in many conversations with data warehouse sponsors that focused on the question of "how are you sure that the data in the warehouse loads correctly every night?" One of the better ways I've found to approach this kind of data integrity assurance is to think about what kinds of assertions can be found throughout the batch ETL processes that I create.

For this example, suppose a somewhat traditional sort of ETL process that happens in the following steps:

1. Copy or extract raw data from source system
2. Detect changes from last pull
3. Lookup surrogate keys and other translations
4. Apply deletes (as soft-deletes with setting exp_date = current_date())
5. Apply inserts
6. Apply updates

For the rest of this post, see the original at Sharpening Stones

Posted by Paul Boal at December 10, 2009 12:30 PM


Post a comment

Remember Me?