September 24, 2008
Understanding Dimensional model in terms of Five Normal Forms in Relational Database Theory
1 INTRODUCTION
“The normal forms defined in relational database theory represent guidelines for record design. The guidelines corresponding to first through fifth normal forms are presented here, in terms that do not require an understanding of relational theory…The normalization rules are designed to prevent update anomalies and data inconsistencies. With respect to performance tradeoffs, these guidelines are biased toward the assumption that all non-key fields will be updated frequently. They tend to penalize retrieval, since data which may have been retrievable from one record in an unnormalized design may have to be retrieved from several records in the normalized form. There is no obligation to fully normalize all records when actual performance requirements are taken into account.”[Kent]. Through the course of this paper treatment of relational model by [Kent] is followed closely and in some cases reproduced verbatim and for brevity will not be individually acknowledged. All arguments relating to dimensional theory belong to the author.
Dimensional model is one of the two competing database design methodologies widely accepted in industry for developing decision support databases (a.ka. Enterprise Data warehouses, Data marts) that base themselves off the data collected as byproduct of business processes. The other design methodology in use is the 3NF relational model. The papers aims to show that dimensional model are in fact in 3NF. It further shows that you end up with dimensional model if you model analytical process of the company instead of business process and follow database theory guidelines. There are other differences between the competing theories – where do you change data from business process representation to analytical representation, in EDW, in Data Marts or in Reports logic.
There is always the argument that a dimensional model doesn’t have to be in 3NF. The thrust of 3NF argument is the removal of possible update or multiple insert anomalies. These don’t exist in dimensional model since the data is primarily read only with the exception of SCD2 dimension record. Even that occurrence can be thought of as a physical construct. Logically dimensional history can be stored as inserts only.
2 Dimensional Model
Dimensional Modeling is a form of data modeling that creates data structures suited for analytical purpose. Designers go through the following steps to create a dimensional model.
0) Define the space you are going to work in i.e. next 123
1) Identify dimensions as entities holding business variables. Abstract out to reduce the number of variables. i.e. Product and service are particular case of same ‘offering’
2) Ensure your dimensions are orthogonal - i.e. there are no relationships between dimensions. Only relationships allowed are hierarchies within dimensions, and relationships between attributes (e.g. code, desc)
3) Further abstract your dimensions at entity level to reduce dimensionality. It is OK if some attributes are empty. You are not trying to reduce space here.
4) Now you have defined an n dimensional space with multiple scales for the problem. Place the data points in the space.
5) Define measures that represent the observed data at the dimensional hierarchy level of data points.
6) Define the relationship between measures at higher levels of hierarchy and lower / observed levels of hierarchy. These could be simple addition of values in each of the dimensional cells or much more complex formulae.
The dimensional model when implemented in a relational database becomes a star schema. The dimensions and fact tables are implemented as tables with the foreign keys referencing the dimensions in a fact table. The set of all foreign keys in fact table relating to dimension tables becomes the primary key of fact table. The measures are implemented as attributes in fact table. In a variation of this called the snowflake schema the dimensions are normalized so that each level in the hierarchy is implemented as separate table.
The purpose of the next few sections is to prove that dimensional model implemented as snowflake schema doesn’t violate the relational database theory. Latter sessions will discuss why the models end up different as one of hallmarks of relational database design is its repeatability – all designers should end up with same logical model independently.
3 FIRST NORMAL FORM
First normal form [1] deals with the "shape" of a record type.
Under first normal form, all occurrences of a record type must contain the same number of fields. This is also true in case of Dimensional model. Dimensional Model as a matter of definition similar to relational model doesn’t deal with variable field records.
3 SECOND AND THIRD NORMAL FORMS
Second and third normal forms [2, 3, 7] deal with the relationship between non-key and key fields.
Under second and third normal forms, a non-key field must provide a fact about the key, us the whole key, and nothing but the key. In addition, the record must satisfy first normal form.
3.1 Second Normal Form
Second normal form is violated when a non-key field is a fact about a subset of a key. It is only relevant when the key is composite, i.e., consists of several fields. Consider the following Fact record:
------------------------------------------
| s_partkey | s_Warehousekey | QUANTITY |
====================----------------------
The key here consists of the s_partkey and s_Warehousekey fields together. They are both surrogate keys. Surrogate keys are machine generated internal system keys that introduce uniqueness to the record when data from multiple systems is merged together. E.g. when two different systems containing partID are merged, the same part ID can refer to different part. Surrogate PartID uniquely identifies each part. This construct is the key of the new data structures.
By definition a single fact or a measure (e.g. Quantity) in the fact table related to the cell defined by the intersection of all the dimensions. At the simplest level the fact table follows 2 NF. There are some popular deviations from this - One is the presence of multiple records in the fact table per combination of keys. This violates the single valued fact principle and is a violation of 2NF but is not necessary for a dimensional model.
Consider the following star schema. The additional entities on top and bottom are called dimensions.
--------------------------------------
| s_partkey | Partname | Part Desc |
============-------------------------
|
|
-------------------------------------------
| s_partkey | s_Warehousekey | QUANTITY |
=============================--------------
|
|
------------------------------------------------------
| s_Warehousekey | WarehouseName | Warehouse Location|
=================-------------------------------------
As represented above the dimensions are in 2 NF. However if hierarchies exist where parts are part of a sub category which is part of a sub category a simple star schema violates 2NF. The snowflake representation of the dimensional model normalizes the dimensions and hence doesn’t violate 2 NF.
3.2 Third Normal Form
Third normal form is violated when a non-key field is a fact about another non-key field, as in
------------------------------------
| EMPLOYEE | DEPARTMENT | LOCATION |
============------------------------
The EMPLOYEE field is the key. If each department is located in one place, then the LOCATION field is a fact about the DEPARTMENT -- in addition to being a fact about the EMPLOYEE.
To satisfy third normal form, the record shown above should be decomposed into the two records:
------------------------- -------------------------
| EMPLOYEE | DEPARTMENT | | DEPARTMENT | LOCATION |
============------------- ==============-----------
A snowflake model of the dimensions will be normalized in exactly the fashion showed above. Per se there is no need for a fact attribute to be related to another fact attribute. Each fact attribute contains information about the entire composite key only. Hence dimensional model can be in 3NF.
To summarize, a record is in second and third normal forms if every field is either part of the key or provides a (single-valued) fact about exactly the whole key and nothing else. A dimensional model is (using snowflake schema and ensuring no that fact attributes are related to set of dimension keys only) also in 3NF.
4 FOURTH AND FIFTH NORMAL FORMS
Fourth [5] and fifth [6] normal forms deal with multi-valued facts. The multi-valued fact may correspond to a many-to-many relationship, as with employees and skills, or to a many-to-one relationship, as with the children of an employee (assuming only one parent is an employee). We look at the many-to-one relationship between children and fathers as a single-valued fact about a child but a multi-valued fact about a father. In dimensional model the relationship between dimension and fact is one to may i.e. single valued fact about the fact table but multivalued about the dimension.
In a sense, fourth and fifth normal forms are also about composite keys. These normal forms attempt to minimize the number of fields involved in a composite key.
4.1 Fourth Normal Form
Under fourth normal form, a record type should not contain two or more independent multi-valued facts about an entity. In addition, the record must satisfy third normal form.
Consider employees, skills, and languages, where an employee may have several skills and several languages. We have here two many-to-many relationships, one between employees and skills, and one between employees and languages. Under fourth normal form, these two relationships should not be represented in a single record such as
-------------------------------
| EMPLOYEE | SKILL | LANGUAGE |
===============================
Instead, they should be represented in the two records
-------------------- -----------------------
| EMPLOYEE | SKILL | | EMPLOYEE | LANGUAGE |
==================== =======================
In dimensional model a multivalued fact is not allowed within a dimension entity. Employee, Skill and Language are all made into dimensions and a factless fact entity is used to represent their intersection. Even while normalizing the dimension in snowflake schema, the direction is always from table closest to fact (representing ‘many’ side of one to many) to the table further away. So only hierarchies are possible not child tables.
A fact table doesn’t have multivalued facts. Each intersection of dimensions is related to only one set of fact attributes /measures.
4.2 Fifth Normal Form
Fifth normal form deals with cases where information can be reconstructed from smaller pieces of information that can be maintained with less redundancy.
Roughly speaking, we may say that a record type is in fifth normal form when its information content cannot be reconstructed from several smaller record types, i.e., from record types each having fewer fields than the original record. The case where all the smaller records have the same key is excluded. If a record type can only be decomposed into smaller records which all have the same key, then the record type is considered to be in fifth normal form without decomposition. A record type in fifth normal form is also in fourth, third, second, and first normal forms.
Fifth normal form does not differ from fourth normal form unless there exists a symmetric constraint between different fields of a record. I.e. between dimensions or between a sestet of dimensions and the measure. Such constraints are not allowed in dimensional modeling.
5 Difference between the competing data warehouse methodologies?
We have shown that the dimensional model as defined in section 2 is compatible with the fifth normal form not just third. However it is practical occurrence that 3NF data model (espoused by Bill Inman), as basis of structure of Enterprise data warehouse is so different from dimensional model espoused by Ralph Kimball.
The basis of their difference can be described differently. Should the best way of modeling an EDW be by integrating the data collected by applications supporting business processes, in a model designed to support business processes or in a model to support analytical processes built from scratch. The lack of support in this new model for business processes, removes some data and relationships not needed and hence results in possibility to structure it differently, but still on conformity with the database theory.
When business processes are shoe horned into the dimensional structure its leads to an incomprehensible spiders web, proliferation of fact and dimension tables. It is much preferable to store the data in 3NF in such a case where there is a business need to store business process data for future use. Its only when analytical processes / requirements are well understood is a dimensional model apt.
These are a number of proper misconceptions of dimensional model especially among advocates of 3 NF. They are more due to improper understanding / improper use of dimensional model. Rather than fundamental issues with the technique. Most of the issues raised are issues with denormalization and the resultant lack of data integrity and ambiguity. These are moot since we have shown dimensional model is 5 NF. Couple of other interesting issues raised are regarding recursive relationships and ragged hierarchies. Since dimensions can be represented as normalized tables, the same techniques used in relational modeling for business processes can be used here too. Another issue raised is the narrow focus of the dimensional model. Again a business process relational model suffers from same issue. It depends on the ability of designer to abstract dimensions just as a relational modeler has to abstract business processes to support change.
The real reason why 3NF models are more accurate / extensible than usual dimensional models is the large availability of highly skilled processionals who can model business processes into relational model. By contrast usual dimensional models are denormalized business process tables by inept designers.
6. Building a dimensional model from scratch, using relational theory.
Imagine a business process an actor1 interacts with actor 2.n and conductions a business transaction that in itself can be made of many pieces. This is easily modeled in relational theory. However the complexity is that a set of such business processes need to be modeled and it needs some effort to come up with a proper relational database design to support all of them at once.
A similar example for an analytical process would be – in what way I would optimize my business to increase sales (i.e. a measure). There will be a number of such requirements. My levers in optimizing my business (for all requirements) are var1, var2,var3 etc. If a variable is not needed for particular requirement, I should be able to ignore it.
1. A starting point for the record will be a list of variables var1,……varn followed by a list of measures mes1…mesn. This is already in 1NF.
2. To get this into 2NF collect groups of related var1. …varn attributes entity1….EntityN such that each of the variables in the group is related to unique key defined for that group. It follows that the original record will have the combination of all unique keys of entity1 ….entityn along with mes1…mesn.
3. Ensure that the composite key of original record (as modified) will have each of the unique keys of entities are not related to each other by transforming the entitity1…..entityn such that they are not related to each other. Nowhere does relational theory expect decomposition to be the only way to achieve normalization. The process followed here is transformation into different orthogonal space. Lets call them dim1….dimn. Lets call the collection of keys& measures as fact table.
4. Normalize dim1….dimn. This results in dimensional model built completely using relational database theory.
7 CONCLUSION
I have presented here the argument that the dimensional model is consistent with 3NF and is in fact in 5 NF, and that a dimensional model can be derived only from using relational database theory when you model analytical processes of the company. The only addition is the use of dimensional transformation instead of record decomposition to normalize data.
8 ACKNOWLEDGMENTS
The author acknowledges the contribution of colleagues in numerous discussions over the years on relational modeling. The author is aware of but doesn’t know the details of another attempt to build a dimensional model from scratch using relational database theory.
9 REFERENCES
1. William Kent, "A Simple Guide to Five Normal Forms in Relational Database Theory", Communications of the ACM 26(2), Feb. 1983, 120-125. and references therein.
2. THE BENEFITS OF BEING NORMAL, 13 reasons why normalized base tables help your business by Tom Russell and Rob Armstrong. Teradata Magazine.
Posted by Vijay at 9:45 PM
September 5, 2008
Salient Features of Data Modeling techniques
Relational: Normalization, Data Application Independence. One type of relation between entities
Object Model: Extensability, reuse, encapsulation. Conditional, Inheritance, association and other kinds of relationships between objects.
Dimensional: Precision, Understandability, Division of variables into measures / facts (Measures of business) and dimensions (parameters of Business Performance). One to many relationships between dimensions and facts. No relationships between dimensions. Hierarchhies within dimensions.
ORM: Mapping Objects onto relational space. Object relationships are built and maintained in application layer
ODM: Mapping Objects onto dimensional space. Multiple dimensional objects per dimension and multiple fact objects are all related to each other, and relationships implemented in reporting layer.
Posted by Vijay at 6:45 AM
My Attempt at creating a hundred dimension data warehouse (Sarcastic)
To actually build a hundred dimension data warehouse, I got myself to think like an OO programmer. I usually don’t work like this. I intend this to prove to my audience that bad design choices lead to complicated data warehouses. Again my refrain is if you want to write programs use java, and programming constructs. Don’t mix data warehousing and programming paradigms. If you do so you will end up not with Dimensional Model but with ODM – Object Dimensional model.
Vision: Create a data warehouse to store FIX data for analysis.
Data Sources: FIX message Logs. Reference data stored in applications.
Business Analysis: Based on understanding of FIX protocol. There are messages components and fields. Fields are used in messages, components or independently. There are 100 messages. Taking a cursory look and removing all messages that look like not important, and making a fact table per message I end up with 64 fact tables. (Wow! almost the same number in a messed up data warehouse in a major stock exchange). The list is at the end of this blog.
Finding dimensions is a little bit more complex. There are nearly 1800 fields and 163 components. The trick is to find the fields and components used by the 64 fact tables below.
F_ResendRequest
F_Reject
F_SequenceReset
F_Logout
F_IOI
F_Advertisement
F_ExecutionReport
F_OrderCancelReject
F_Logon
F_News
F_Email
F_NewOrderSingle
F_NewOrderList
F_OrderCancelRequest
F_OrderCancelReplaceRequest
F_OrderStatusRequest
F_AllocationInstruction
F_ListCancelRequest
F_ListExecute
F_ListStatusRequest
F_ListStatus
F_AllocationInstructionAck
F_DontKnowTradeDK
F_QuoteRequest
F_Quote
F_SettlementInstructions
F_MarketDataRequest
F_MarketDataSnapshotFullRefresh
F_MarketDataIncrementalRefresh
F_MarketDataRequestReject
F_QuoteCancel
F_QuoteStatusRequest
F_MassQuoteAcknowledgement
F_SecurityDefinitionRequest
F_SecurityDefinition
F_SecurityStatusRequest
F_SecurityStatus
F_TradingSessionStatusRequest
F_TradingSessionStatus
F_MassQuote
F_BusinessMessageReject
F_BidRequest
F_BidResponse
F_ListStrikePrice
F_XML_non_FIX
F_RegistrationInstructions
F_RegistrationInstructionsResponse
F_OrderMassCancelRequest
F_OrderMassCancelReport
F_NewOrderCross
F_CrossOrderCancelReplaceRequest
F_CrossOrderCancelRequest
F_SecurityTypeRequest
F_SecurityTypes
F_SecurityListRequest
F_SecurityList
F_DerivativeSecurityListRequest
F_DerivativeSecurityList
F_NewOrderMultileg
F_MultilegOrderCancelReplace
F_TradeCaptureReportRequest
F_TradeCaptureReport
F_OrderMassStatusRequest
F_QuoteRequestReject
F_RFQRequest
F_QuoteStatusReport
F_QuoteResponse
F_Confirmation
F_PositionMaintenanceRequest
F_PositionMaintenanceReport
F_RequestForPositions
F_RequestForPositionsAck
F_PositionReport
F_TradeCaptureReportRequestAck
F_TradeCaptureReportAck
F_AllocationReport
F_AllocationReportAck
F_Confirmation_Ack
F_SettlementInstructionRequest
F_AssignmentReport
F_CollateralRequest
F_CollateralAssignment
F_CollateralResponse
F_CollateralReport
F_CollateralInquiry
F_NetworkCounterpartySystemStatusRequest
F_NetworkCounterpartySystemStatusResponse
F_UserRequest
F_UserResponse
F_CollateralInquiryAck
F_ConfirmationRequest
F_ContraryIntentionReport
F_SecurityDefinitionUpdateReport
F_SecurityListUpdateReport
F_AdjustedPositionReport
F_AllocationInstructionAlert
F_ExecutionAcknowledgement
F_TradingSessionList
F_TradingSessionListRequest
I see a pattern in the fields. For some fields there is ID and Type. Browsing through the fields these look like the fields I’d like to make a report by. So I make these as dimensions. There are 163 of them. This looks erringly like the data warehouse I saw at a major stock exchange. Here is the dimension list.
D_AccountType
D_AdjustmentType
D_AdvTransType
D_AllocAccountType
D_AllocIntermedReqType
D_AllocLinkType
D_AllocNoOrdersType
D_AllocReportType
D_AllocSettlInstType
D_AllocTransType
D_AllocType
D_ApplReportType
D_ApplReqType
D_ApplResponseType
D_BasisPxType
D_BenchmarkPriceType
D_BidDescriptorType
D_BidRequestTransType
D_BidTradeType
D_BidType
D_BookingType
D_CPRegType
D_CollApplType
D_CollAsgnRespType
D_CollAsgnTransType
D_CommType
D_ConfirmTransType
D_ConfirmType
D_ContAmtType
D_ContingencyType
D_CrossType
D_CxlType
D_DeliveryType
D_DerivativeEventType
D_DerivativeInstrAttribType
D_DerivativeInstrumentPartySubIDType
D_DerivativeSecurityListRequestType
D_DerivativeSecuritySubType
D_DerivativeSecurityType
D_DeskType
D_DiscretionLimitType
D_DiscretionMoveType
D_DiscretionOffsetType
D_DlvyInstType
D_EmailType
D_EventType
D_ExecPriceType
D_ExecTransType
D_ExecType
D_ExpirationQtyType
D_IOITransType
D_IndividualAllocType
D_InstrAttribType
D_InstrumentPartySubIDType
D_LegBenchmarkPriceType
D_LegPriceType
D_LegRepoCollateralSecurityType
D_LegSecuritySubType
D_LegSecurityType
D_LegSettlType
D_LegStipulationType
D_LegSwapType
D_LiquidityIndType
D_ListExecInstType
D_ListStatusType
D_LotType
D_MDBookType
D_MDEntryType
D_MDFeedType
D_MDOriginType
D_MDQuoteType
D_MDSecSizeType
D_MDSubBookType
D_MDUpdateType
D_MassActionType
D_MassCancelRequestType
D_MassStatusReqType
D_MatchType
D_MiscFeeType
D_MsgType
D_MultiLegReportingType
D_Nested2PartySubIDType
D_Nested3PartySubIDType
D_Nested4PartySubIDType
D_NestedInstrAttribType
D_NestedPartySubIDType
D_NetworkRequestType
D_NetworkStatusResponseType
D_OrdType
D_OwnerType
D_OwnershipType
D_PartySubIDType
D_PegLimitType
D_PegMoveType
D_PegOffsetType
D_PegPriceType
D_PosAmtType
D_PosReqType
D_PosTransType
D_PosType
D_PriceLimitType
D_PriceType
D_QtyType
D_QuantityType
D_QuoteCancelType
D_QuotePriceType
D_QuoteRequestType
D_QuoteRespType
D_QuoteType
D_RefMsgType
D_RegistAcctType
D_RegistTransType
D_RepoCollateralSecurityType
D_RespondentType
D_ResponseTransportType
D_RootPartySubIDType
D_RoutingType
D_SecondaryPriceLimitType
D_SecondaryTrdType
D_SecurityListRequestType
D_SecurityRequestType
D_SecurityResponseType
D_SecuritySubType
D_SecurityType
D_SettlDeliveryType
D_SettlInstTransType
D_SettlObligTransType
D_SettlPartySubIDType
D_SettlPriceType
D_SettlType
D_SideMultiLegReportingType
D_SideTrdRegTimestampType
D_StandInstDbType
D_StatsType
D_StipulationType
D_StrategyParameterType
D_SubscriptionRequestType
D_TaxAdvantageType
D_TerminationType
D_TickRuleType
D_TradeReportTransType
D_TradeReportType
D_TradeRequestType
D_TrdRegTimestampType
D_TrdSubType
D_TrdType
D_TriggerOrderType
D_TriggerPriceType
D_TriggerType
D_UnderlyingCPRegType
D_UnderlyingCashType
D_UnderlyingLegSecuritySubType
D_UnderlyingLegSecurityType
D_UnderlyingRepoCollateralSecurityType
D_UnderlyingSecuritySubType
D_UnderlyingSecurityType
D_UnderlyingSettlPriceType
D_UnderlyingSettlementType
D_UnderlyingStipType
D_UndlyInstrumentPartySubIDType
D_UserRequestType
D_YieldRedemptionPriceType
D_YieldType
There you have it – a 150-dimension 60-fact table monster badly designed. No doubt you will have to write huge programs to analyze this data.
Doing it the right way is much harder and will take much longer and will need an understanding of the business. I need to get paid to do that, but I leave you with final thought. I have reduced a similar 150 dimension 50 monster at a major stock exchange to a hypothetical model with <15 dimensions and 3 fact tables. That should be an indicator of where you need to be.
Posted by Vijay at 6:45 AM
