« Salient Features of Data Modeling techniques | Main
September 24, 2008
Understanding Dimensional model in terms of Five Normal Forms in Relational Database Theory
1 INTRODUCTION
“The normal forms defined in relational database theory represent guidelines for record design. The guidelines corresponding to first through fifth normal forms are presented here, in terms that do not require an understanding of relational theory…The normalization rules are designed to prevent update anomalies and data inconsistencies. With respect to performance tradeoffs, these guidelines are biased toward the assumption that all non-key fields will be updated frequently. They tend to penalize retrieval, since data which may have been retrievable from one record in an unnormalized design may have to be retrieved from several records in the normalized form. There is no obligation to fully normalize all records when actual performance requirements are taken into account.”[Kent]. Through the course of this paper treatment of relational model by [Kent] is followed closely and in some cases reproduced verbatim and for brevity will not be individually acknowledged. All arguments relating to dimensional theory belong to the author.
Dimensional model is one of the two competing database design methodologies widely accepted in industry for developing decision support databases (a.ka. Enterprise Data warehouses, Data marts) that base themselves off the data collected as byproduct of business processes. The other design methodology in use is the 3NF relational model. The papers aims to show that dimensional model are in fact in 3NF. It further shows that you end up with dimensional model if you model analytical process of the company instead of business process and follow database theory guidelines. There are other differences between the competing theories – where do you change data from business process representation to analytical representation, in EDW, in Data Marts or in Reports logic.
There is always the argument that a dimensional model doesn’t have to be in 3NF. The thrust of 3NF argument is the removal of possible update or multiple insert anomalies. These don’t exist in dimensional model since the data is primarily read only with the exception of SCD2 dimension record. Even that occurrence can be thought of as a physical construct. Logically dimensional history can be stored as inserts only.
2 Dimensional Model
Dimensional Modeling is a form of data modeling that creates data structures suited for analytical purpose. Designers go through the following steps to create a dimensional model.
0) Define the space you are going to work in i.e. next 123
1) Identify dimensions as entities holding business variables. Abstract out to reduce the number of variables. i.e. Product and service are particular case of same ‘offering’
2) Ensure your dimensions are orthogonal - i.e. there are no relationships between dimensions. Only relationships allowed are hierarchies within dimensions, and relationships between attributes (e.g. code, desc)
3) Further abstract your dimensions at entity level to reduce dimensionality. It is OK if some attributes are empty. You are not trying to reduce space here.
4) Now you have defined an n dimensional space with multiple scales for the problem. Place the data points in the space.
5) Define measures that represent the observed data at the dimensional hierarchy level of data points.
6) Define the relationship between measures at higher levels of hierarchy and lower / observed levels of hierarchy. These could be simple addition of values in each of the dimensional cells or much more complex formulae.
The dimensional model when implemented in a relational database becomes a star schema. The dimensions and fact tables are implemented as tables with the foreign keys referencing the dimensions in a fact table. The set of all foreign keys in fact table relating to dimension tables becomes the primary key of fact table. The measures are implemented as attributes in fact table. In a variation of this called the snowflake schema the dimensions are normalized so that each level in the hierarchy is implemented as separate table.
The purpose of the next few sections is to prove that dimensional model implemented as snowflake schema doesn’t violate the relational database theory. Latter sessions will discuss why the models end up different as one of hallmarks of relational database design is its repeatability – all designers should end up with same logical model independently.
3 FIRST NORMAL FORM
First normal form [1] deals with the "shape" of a record type.
Under first normal form, all occurrences of a record type must contain the same number of fields. This is also true in case of Dimensional model. Dimensional Model as a matter of definition similar to relational model doesn’t deal with variable field records.
3 SECOND AND THIRD NORMAL FORMS
Second and third normal forms [2, 3, 7] deal with the relationship between non-key and key fields.
Under second and third normal forms, a non-key field must provide a fact about the key, us the whole key, and nothing but the key. In addition, the record must satisfy first normal form.
3.1 Second Normal Form
Second normal form is violated when a non-key field is a fact about a subset of a key. It is only relevant when the key is composite, i.e., consists of several fields. Consider the following Fact record:
------------------------------------------
| s_partkey | s_Warehousekey | QUANTITY |
====================----------------------
The key here consists of the s_partkey and s_Warehousekey fields together. They are both surrogate keys. Surrogate keys are machine generated internal system keys that introduce uniqueness to the record when data from multiple systems is merged together. E.g. when two different systems containing partID are merged, the same part ID can refer to different part. Surrogate PartID uniquely identifies each part. This construct is the key of the new data structures.
By definition a single fact or a measure (e.g. Quantity) in the fact table related to the cell defined by the intersection of all the dimensions. At the simplest level the fact table follows 2 NF. There are some popular deviations from this - One is the presence of multiple records in the fact table per combination of keys. This violates the single valued fact principle and is a violation of 2NF but is not necessary for a dimensional model.
Consider the following star schema. The additional entities on top and bottom are called dimensions.
--------------------------------------
| s_partkey | Partname | Part Desc |
============-------------------------
|
|
-------------------------------------------
| s_partkey | s_Warehousekey | QUANTITY |
=============================--------------
|
|
------------------------------------------------------
| s_Warehousekey | WarehouseName | Warehouse Location|
=================-------------------------------------
As represented above the dimensions are in 2 NF. However if hierarchies exist where parts are part of a sub category which is part of a sub category a simple star schema violates 2NF. The snowflake representation of the dimensional model normalizes the dimensions and hence doesn’t violate 2 NF.
3.2 Third Normal Form
Third normal form is violated when a non-key field is a fact about another non-key field, as in
------------------------------------
| EMPLOYEE | DEPARTMENT | LOCATION |
============------------------------
The EMPLOYEE field is the key. If each department is located in one place, then the LOCATION field is a fact about the DEPARTMENT -- in addition to being a fact about the EMPLOYEE.
To satisfy third normal form, the record shown above should be decomposed into the two records:
------------------------- -------------------------
| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION |
============------------- ==============-----------
A snowflake model of the dimensions will be normalized in exactly the fashion showed above. Per se there is no need for a fact attribute to be related to another fact attribute. Each fact attribute contains information about the entire composite key only. Hence dimensional model can be in 3NF.
To summarize, a record is in second and third normal forms if every field is either part of the key or provides a (single-valued) fact about exactly the whole key and nothing else. A dimensional model is (using snowflake schema and ensuring no that fact attributes are related to set of dimension keys only) also in 3NF.
4 FOURTH AND FIFTH NORMAL FORMS
Fourth [5] and fifth [6] normal forms deal with multi-valued facts. The multi-valued fact may correspond to a many-to-many relationship, as with employees and skills, or to a many-to-one relationship, as with the children of an employee (assuming only one parent is an employee). We look at the many-to-one relationship between children and fathers as a single-valued fact about a child but a multi-valued fact about a father. In dimensional model the relationship between dimension and fact is one to may i.e. single valued fact about the fact table but multivalued about the dimension.
In a sense, fourth and fifth normal forms are also about composite keys. These normal forms attempt to minimize the number of fields involved in a composite key.
4.1 Fourth Normal Form
Under fourth normal form, a record type should not contain two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form.
Consider employees, skills, and languages, where an employee may have several skills and several languages. We have here two many-to-many relationships, one between employees and skills, and one between employees and languages. Under fourth normal form, these two relationships should not be represented in a single record such as
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
===============================
Instead, they should be represented in the two records
-------------------- -----------------------
| EMPLOYEE | SKILL | | EMPLOYEE | LANGUAGE |
==================== =======================
In dimensional model a multivalued fact is not allowed within a dimension entity. Employee, Skill and Language are all made into dimensions and a factless fact entity is used to represent their intersection. Even while normalizing the dimension in snowflake schema, the direction is always from table closest to fact (representing ‘many’ side of one to many) to the table further away. So only hierarchies are possible not child tables.
A fact table doesn’t have multivalued facts. Each intersection of dimensions is related to only one set of fact attributes /measures.
4.2 Fifth Normal Form
Fifth normal form deals with cases where information can be reconstructed from smaller pieces of information that can be maintained with less redundancy.
Roughly speaking, we may say that a record type is in fifth normal form when its information content cannot be reconstructed from several smaller record types, i.e., from record types each having fewer fields than the original record. The case where all the smaller records have the same key is excluded. If a record type can only be decomposed into smaller records which all have the same key, then the record type is considered to be in fifth normal form without decomposition. A record type in fifth normal form is also in fourth, third, second, and first normal forms.
Fifth normal form does not differ from fourth normal form unless there exists a symmetric constraint between different fields of a record. I.e. between dimensions or between a sestet of dimensions and the measure. Such constraints are not allowed in dimensional modeling.
5 Difference between the competing data warehouse methodologies?
We have shown that the dimensional model as defined in section 2 is compatible with the fifth normal form not just third. However it is practical occurrence that 3NF data model (espoused by Bill Inman), as basis of structure of Enterprise data warehouse is so different from dimensional model espoused by Ralph Kimball.
The basis of their difference can be described differently. Should the best way of modeling an EDW be by integrating the data collected by applications supporting business processes, in a model designed to support business processes or in a model to support analytical processes built from scratch. The lack of support in this new model for business processes, removes some data and relationships not needed and hence results in possibility to structure it differently, but still on conformity with the database theory.
When business processes are shoe horned into the dimensional structure its leads to an incomprehensible spiders web, proliferation of fact and dimension tables. It is much preferable to store the data in 3NF in such a case where there is a business need to store business process data for future use. Its only when analytical processes / requirements are well understood is a dimensional model apt.
These are a number of proper misconceptions of dimensional model especially among advocates of 3 NF. They are more due to improper understanding / improper use of dimensional model. Rather than fundamental issues with the technique. Most of the issues raised are issues with denormalization and the resultant lack of data integrity and ambiguity. These are moot since we have shown dimensional model is 5 NF. Couple of other interesting issues raised are regarding recursive relationships and ragged hierarchies. Since dimensions can be represented as normalized tables, the same techniques used in relational modeling for business processes can be used here too. Another issue raised is the narrow focus of the dimensional model. Again a business process relational model suffers from same issue. It depends on the ability of designer to abstract dimensions just as a relational modeler has to abstract business processes to support change.
The real reason why 3NF models are more accurate / extensible than usual dimensional models is the large availability of highly skilled processionals who can model business processes into relational model. By contrast usual dimensional models are denormalized business process tables by inept designers.
6. Building a dimensional model from scratch, using relational theory.
Imagine a business process an actor1 interacts with actor 2.n and conductions a business transaction that in itself can be made of many pieces. This is easily modeled in relational theory. However the complexity is that a set of such business processes need to be modeled and it needs some effort to come up with a proper relational database design to support all of them at once.
A similar example for an analytical process would be – in what way I would optimize my business to increase sales (i.e. a measure). There will be a number of such requirements. My levers in optimizing my business (for all requirements) are var1, var2,var3 etc. If a variable is not needed for particular requirement, I should be able to ignore it.
1. A starting point for the record will be a list of variables var1,……varn followed by a list of measures mes1…mesn. This is already in 1NF.
2. To get this into 2NF collect groups of related var1. …varn attributes entity1….EntityN such that each of the variables in the group is related to unique key defined for that group. It follows that the original record will have the combination of all unique keys of entity1 ….entityn along with mes1…mesn.
3. Ensure that the composite key of original record (as modified) will have each of the unique keys of entities are not related to each other by transforming the entitity1…..entityn such that they are not related to each other. Nowhere does relational theory expect decomposition to be the only way to achieve normalization. The process followed here is transformation into different orthogonal space. Lets call them dim1….dimn. Lets call the collection of keys& measures as fact table.
4. Normalize dim1….dimn. This results in dimensional model built completely using relational database theory.
7 CONCLUSION
I have presented here the argument that the dimensional model is consistent with 3NF and is in fact in 5 NF, and that a dimensional model can be derived only from using relational database theory when you model analytical processes of the company. The only addition is the use of dimensional transformation instead of record decomposition to normalize data.
8 ACKNOWLEDGMENTS
The author acknowledges the contribution of colleagues in numerous discussions over the years on relational modeling. The author is aware of but doesn’t know the details of another attempt to build a dimensional model from scratch using relational database theory.
9 REFERENCES
1. William Kent, "A Simple Guide to Five Normal Forms in Relational Database Theory", Communications of the ACM 26(2), Feb. 1983, 120-125. and references therein.
2. THE BENEFITS OF BEING NORMAL, 13 reasons why normalized base tables help your business by Tom Russell and Rob Armstrong. Teradata Magazine.
Posted by Vijay at September 24, 2008 9:45 PM
