<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
<channel>
<title>DataSense</title>
<link>http://www.beyeblogs.com/DataSense/</link>
<description>What is Data Warehousing &amp; how do I learn it? I explain to them we use all the same tools that you do but differently. That’s when I coined the term Data Sense. It describes the essence of Data Warehousing and separates Data Warehousing from rest of Programming. Every aspect of IT from Hardware / Software infrastructure to Design, Development and QA is done with massive data flows and need for data precession, accuracy and meaning. View Vijay Nadendla&apos;s profile

</description>
<language>en</language>
<copyright>Copyright 2011</copyright>
<lastBuildDate>Fri, 05 Sep 2008 06:45:00 -0700</lastBuildDate>
<generator>http://www.movabletype.org/?v=3.33</generator>
<docs>http://blogs.law.harvard.edu/tech/rss</docs> 


<item>
<title>Salient Features of Data Modeling techniques</title>
<description><![CDATA[<p>Relational: Normalization, Data Application Independence. One type of relation between entities<br />
Object Model: Extensability, reuse, encapsulation. Conditional, Inheritance, association and other kinds of relationships between objects.<br />
Dimensional: Precision, Understandability, Division of variables into measures / facts (Measures of business) and dimensions (parameters of Business Performance). One to many relationships between dimensions and facts. No relationships between dimensions. Hierarchhies within dimensions.<br />
ORM: Mapping Objects onto relational space. Object relationships are built and maintained in application layer<br />
ODM: Mapping Objects onto dimensional space. Multiple dimensional objects per dimension and multiple fact objects are all related to each other, and relationships  implemented in reporting layer.</p>]]></description>
<link>http://www.beyeblogs.com/DataSense/archive/2008/09/salient_features_of_data_model.php</link>
<guid>http://www.beyeblogs.com/DataSense/archive/2008/09/salient_features_of_data_model.php</guid>
<category></category>
<pubDate>Fri, 05 Sep 2008 06:45:00 -0700</pubDate>
</item>

<item>
<title>My Attempt at creating a hundred dimension data warehouse (Sarcastic)</title>
<description><![CDATA[<p><br />
To actually build a hundred   dimension data warehouse, I got myself to think like an OO programmer. I usually don’t work like this. I intend this to prove to my audience that bad design choices lead to complicated data warehouses. Again my refrain is if you want to write programs use java, and programming constructs. Don’t mix data warehousing and programming paradigms. If you do so you will end up not with Dimensional Model but with ODM – Object Dimensional model. <br />
Vision: Create a data warehouse to store FIX data for analysis. <br />
Data Sources: FIX message Logs. Reference data stored in applications. <br />
Business Analysis: Based on understanding of FIX protocol. There are messages components and fields. Fields are used in messages, components or independently. There are 100  messages. Taking a cursory look and removing all messages that look like not important, and making a fact table per message I end up with 64 fact tables. (Wow! almost the same number in a messed up data warehouse in a major stock exchange). The list is at the end of this blog. <br />
Finding dimensions is a little bit more complex. There are nearly 1800 fields and 163 components. The trick is to find the fields and components used by the 64 fact tables below. <br />
F_ResendRequest <br />
F_Reject <br />
F_SequenceReset <br />
F_Logout <br />
F_IOI <br />
F_Advertisement <br />
F_ExecutionReport <br />
F_OrderCancelReject <br />
F_Logon <br />
F_News <br />
F_Email <br />
F_NewOrderSingle <br />
F_NewOrderList <br />
F_OrderCancelRequest <br />
F_OrderCancelReplaceRequest <br />
F_OrderStatusRequest <br />
F_AllocationInstruction <br />
F_ListCancelRequest <br />
F_ListExecute <br />
F_ListStatusRequest <br />
F_ListStatus <br />
F_AllocationInstructionAck <br />
F_DontKnowTradeDK <br />
F_QuoteRequest <br />
F_Quote <br />
F_SettlementInstructions <br />
F_MarketDataRequest <br />
F_MarketDataSnapshotFullRefresh <br />
F_MarketDataIncrementalRefresh <br />
F_MarketDataRequestReject <br />
F_QuoteCancel <br />
F_QuoteStatusRequest <br />
F_MassQuoteAcknowledgement <br />
F_SecurityDefinitionRequest <br />
F_SecurityDefinition <br />
F_SecurityStatusRequest <br />
F_SecurityStatus <br />
F_TradingSessionStatusRequest <br />
F_TradingSessionStatus <br />
F_MassQuote <br />
F_BusinessMessageReject <br />
F_BidRequest <br />
F_BidResponse <br />
F_ListStrikePrice <br />
F_XML_non_FIX <br />
F_RegistrationInstructions <br />
F_RegistrationInstructionsResponse <br />
F_OrderMassCancelRequest <br />
F_OrderMassCancelReport <br />
F_NewOrderCross <br />
F_CrossOrderCancelReplaceRequest <br />
F_CrossOrderCancelRequest <br />
F_SecurityTypeRequest <br />
F_SecurityTypes <br />
F_SecurityListRequest <br />
F_SecurityList <br />
F_DerivativeSecurityListRequest <br />
F_DerivativeSecurityList <br />
F_NewOrderMultileg <br />
F_MultilegOrderCancelReplace <br />
F_TradeCaptureReportRequest <br />
F_TradeCaptureReport <br />
F_OrderMassStatusRequest <br />
F_QuoteRequestReject <br />
F_RFQRequest <br />
F_QuoteStatusReport <br />
F_QuoteResponse <br />
F_Confirmation <br />
F_PositionMaintenanceRequest <br />
F_PositionMaintenanceReport <br />
F_RequestForPositions <br />
F_RequestForPositionsAck <br />
F_PositionReport <br />
F_TradeCaptureReportRequestAck <br />
F_TradeCaptureReportAck <br />
F_AllocationReport <br />
F_AllocationReportAck <br />
F_Confirmation_Ack <br />
F_SettlementInstructionRequest <br />
F_AssignmentReport <br />
F_CollateralRequest <br />
F_CollateralAssignment <br />
F_CollateralResponse <br />
F_CollateralReport <br />
F_CollateralInquiry <br />
F_NetworkCounterpartySystemStatusRequest <br />
F_NetworkCounterpartySystemStatusResponse <br />
F_UserRequest <br />
F_UserResponse <br />
F_CollateralInquiryAck <br />
F_ConfirmationRequest <br />
F_ContraryIntentionReport <br />
F_SecurityDefinitionUpdateReport <br />
F_SecurityListUpdateReport <br />
F_AdjustedPositionReport <br />
F_AllocationInstructionAlert <br />
F_ExecutionAcknowledgement <br />
F_TradingSessionList <br />
F_TradingSessionListRequest </p>

<p>I see a pattern in the fields. For some fields there is ID and Type. Browsing through the fields these look like the fields I’d like to make a report by. So I make these as dimensions. There are 163 of them. This looks erringly like the data warehouse I saw at a major stock exchange. Here is the dimension list. </p>

<p>D_AccountType <br />
D_AdjustmentType <br />
D_AdvTransType <br />
D_AllocAccountType <br />
D_AllocIntermedReqType <br />
D_AllocLinkType <br />
D_AllocNoOrdersType <br />
D_AllocReportType <br />
D_AllocSettlInstType <br />
D_AllocTransType <br />
D_AllocType <br />
D_ApplReportType <br />
D_ApplReqType <br />
D_ApplResponseType <br />
D_BasisPxType <br />
D_BenchmarkPriceType <br />
D_BidDescriptorType <br />
D_BidRequestTransType <br />
D_BidTradeType <br />
D_BidType <br />
D_BookingType <br />
D_CPRegType <br />
D_CollApplType <br />
D_CollAsgnRespType <br />
D_CollAsgnTransType <br />
D_CommType <br />
D_ConfirmTransType <br />
D_ConfirmType <br />
D_ContAmtType <br />
D_ContingencyType <br />
D_CrossType <br />
D_CxlType <br />
D_DeliveryType <br />
D_DerivativeEventType <br />
D_DerivativeInstrAttribType <br />
D_DerivativeInstrumentPartySubIDType <br />
D_DerivativeSecurityListRequestType <br />
D_DerivativeSecuritySubType <br />
D_DerivativeSecurityType <br />
D_DeskType <br />
D_DiscretionLimitType <br />
D_DiscretionMoveType <br />
D_DiscretionOffsetType <br />
D_DlvyInstType <br />
D_EmailType <br />
D_EventType <br />
D_ExecPriceType <br />
D_ExecTransType <br />
D_ExecType <br />
D_ExpirationQtyType <br />
D_IOITransType <br />
D_IndividualAllocType <br />
D_InstrAttribType <br />
D_InstrumentPartySubIDType <br />
D_LegBenchmarkPriceType <br />
D_LegPriceType <br />
D_LegRepoCollateralSecurityType <br />
D_LegSecuritySubType <br />
D_LegSecurityType <br />
D_LegSettlType <br />
D_LegStipulationType <br />
D_LegSwapType <br />
D_LiquidityIndType <br />
D_ListExecInstType <br />
D_ListStatusType <br />
D_LotType <br />
D_MDBookType <br />
D_MDEntryType <br />
D_MDFeedType <br />
D_MDOriginType <br />
D_MDQuoteType <br />
D_MDSecSizeType <br />
D_MDSubBookType <br />
D_MDUpdateType <br />
D_MassActionType <br />
D_MassCancelRequestType <br />
D_MassStatusReqType <br />
D_MatchType <br />
D_MiscFeeType <br />
D_MsgType <br />
D_MultiLegReportingType <br />
D_Nested2PartySubIDType <br />
D_Nested3PartySubIDType <br />
D_Nested4PartySubIDType <br />
D_NestedInstrAttribType <br />
D_NestedPartySubIDType <br />
D_NetworkRequestType <br />
D_NetworkStatusResponseType <br />
D_OrdType <br />
D_OwnerType <br />
D_OwnershipType <br />
D_PartySubIDType <br />
D_PegLimitType <br />
D_PegMoveType <br />
D_PegOffsetType <br />
D_PegPriceType <br />
D_PosAmtType <br />
D_PosReqType <br />
D_PosTransType <br />
D_PosType <br />
D_PriceLimitType <br />
D_PriceType <br />
D_QtyType <br />
D_QuantityType <br />
D_QuoteCancelType <br />
D_QuotePriceType <br />
D_QuoteRequestType <br />
D_QuoteRespType <br />
D_QuoteType <br />
D_RefMsgType <br />
D_RegistAcctType <br />
D_RegistTransType <br />
D_RepoCollateralSecurityType <br />
D_RespondentType <br />
D_ResponseTransportType <br />
D_RootPartySubIDType <br />
D_RoutingType <br />
D_SecondaryPriceLimitType <br />
D_SecondaryTrdType <br />
D_SecurityListRequestType <br />
D_SecurityRequestType <br />
D_SecurityResponseType <br />
D_SecuritySubType <br />
D_SecurityType <br />
D_SettlDeliveryType <br />
D_SettlInstTransType <br />
D_SettlObligTransType <br />
D_SettlPartySubIDType <br />
D_SettlPriceType <br />
D_SettlType <br />
D_SideMultiLegReportingType <br />
D_SideTrdRegTimestampType <br />
D_StandInstDbType <br />
D_StatsType <br />
D_StipulationType <br />
D_StrategyParameterType <br />
D_SubscriptionRequestType <br />
D_TaxAdvantageType <br />
D_TerminationType <br />
D_TickRuleType <br />
D_TradeReportTransType <br />
D_TradeReportType <br />
D_TradeRequestType <br />
D_TrdRegTimestampType <br />
D_TrdSubType <br />
D_TrdType <br />
D_TriggerOrderType <br />
D_TriggerPriceType <br />
D_TriggerType <br />
D_UnderlyingCPRegType <br />
D_UnderlyingCashType <br />
D_UnderlyingLegSecuritySubType <br />
D_UnderlyingLegSecurityType <br />
D_UnderlyingRepoCollateralSecurityType <br />
D_UnderlyingSecuritySubType <br />
D_UnderlyingSecurityType <br />
D_UnderlyingSettlPriceType <br />
D_UnderlyingSettlementType <br />
D_UnderlyingStipType <br />
D_UndlyInstrumentPartySubIDType <br />
D_UserRequestType <br />
D_YieldRedemptionPriceType <br />
D_YieldType <br />
There you have it – a 150-dimension 60-fact table monster badly designed. No doubt you will have to write huge programs to analyze this data. <br />
Doing it the right way is much harder and will take much longer and will need an understanding of the business. I need to get paid to do that, but I leave you with final thought. I have reduced a similar 150   dimension 50   monster at a major stock exchange to a hypothetical model with <15 dimensions and 3 fact tables. That should be an indicator of where you need to be.</p>]]></description>
<link>http://www.beyeblogs.com/DataSense/archive/2008/09/my_attempt_at_creating_a_hundr.php</link>
<guid>http://www.beyeblogs.com/DataSense/archive/2008/09/my_attempt_at_creating_a_hundr.php</guid>
<category></category>
<pubDate>Fri, 05 Sep 2008 06:45:00 -0700</pubDate>
</item>

<item>
<title>How and why are hundred dimension data warehouses designed? Is it the Birth of Object Dimensional Modeling.</title>
<description><![CDATA[<p>As follow up on my previous blogs on the need for and process to design a simple dimensional model that can satisfy the analysis requirements, I have investigated how and why are 100 dimension monsters built. I acknolwedge conversations with colleagues on this subject. </p>

<p>Why<br />
1) Sponsors of data warehouse want every single attribute from all data sources be stored in the data warehouse. This is the big bang approach - build once for all future. Users also don't want to be told a particuler attribute is not present in the data warehouse after its built. So safest approach is to store all.</p>

<p>2) Users are used to the data elements as they worked with during the business processes. Abstraction into new variables with take away the familiarity and there is reluctance.</p>

<p>3) Designers were never exposed to variable reduction using abstraction and hence don't consider it as an option. Designers have an improper understanding of what a dimension is - they slap dim infront of any source table name and assume its dimension of data warehouse. Further they perform horizontal partitioning of data (very OO)for performance reasons. </p>

<p>How:</p>

<p>1) Source databases' data models are copied into one structure. Each source (or a small set of sources) becomes a subject area.</p>

<p>2) Tables that completely (or very nearly overlap) between data sources are merged.</p>

<p>3) The purpose of each table is investigated. If it is a transaction - it becomes a fact table (2 or more tables can be merged). If it is a reference / lookup / business process and such non transaction table it become a dimension. Conform dimensions are set up between subject areas.</p>

<p>4) Other tables / data are merged into the identified tact / dimension  tables. SCD properties are identified and metadata added. Load process metadata process tables / attributes are added.</p>

<p>This process reminds me of the creation of a collage and the end effect is very much the same. A very fragmented, but tightly knit data structure.</p>

<p>As my colleagues mentioned this is the bottom up approach - my interpretation of bottom up approach is different and I'll make it subject of future blog. </p>

<p>My Comments:</p>

<p>1) I look at this as data / database consolidation project not a data warehousing project. </p>

<p>2) I am not sure if there is any advantage of moving to dimensional modeling in such cases. Why not use 3NF in cases where you need to build a complete data repository for the enterprise. I am not sure 100 dimensional data web is more understandable, accurate or faster than 3 NF model. It fact it is the opposite.</p>

<p>3) Some techniques smell similar to ORM. And given the fact that most sources are already in Object models; are we looking at birth of ODM - Object Dimensional Mapping? </p>

<p>4) Data access software and tools should take into account the above structures. Significant ammount of code needs to be written to analyze data in above scenario. This looks very much like stove pipe analysis scenario. Doing stove pipe analysis and maintaining code in MDX is non trivial for example.</p>

<p>5) What does a dimension mean in above schema? What does a fact table mean? What is the business-dimensional space created? Infact in above schema other than the way the history is tracked; is there any difference between dimension and fact tables. So does a dimension mean - history tracked smaller table, and a fact a non history tracked bigger / transaction table?</p>

<p>I guess I should take a 100 dimension data model as example and recast it into much simpler format and present both here as an example. Given the non graphical nature of the blog that may be difficult - but look forward for it.<br />
Vijay</p>

<p><br />
</p>]]></description>
<link>http://www.beyeblogs.com/DataSense/archive/2008/08/how_and_why_are_hundred_dimens.php</link>
<guid>http://www.beyeblogs.com/DataSense/archive/2008/08/how_and_why_are_hundred_dimens.php</guid>
<category></category>
<pubDate>Sat, 30 Aug 2008 04:30:00 -0700</pubDate>
</item>

<item>
<title>Bad dimension choice complicates measures and trend analysis</title>
<description><![CDATA[<p>This is oft repeated question - why are 100 dimension star schema / cubes bad. I'll give you a simple mathematical example of how bad dimension choice complicates interpretation of measures and trend analysis.</p>

<p>Consider the natural coordinate system we are all used to - x and y. Distance from origin to a point (x,y) is sqrt(x**2 y**2). It is perfectly symetrical and simple.</p>

<p>Now consider an artificial non orthogonal coordinate system with an additional dimension defined by a=X**2*y, b = x*y**2, c = x*y. The distance of a point to origin (((a b)/c)**2)- 2c. This is not symetrical in a,b and c and difficult to interpret. of course as you add more dimensions its harder.</p>

<p>A special coordinate system like (r,theta,phi) is symmetrical and simpler for distance for origing - but its difficult in general case - e.g. define a cube.</p>

<p>The availability and ability of a tool to support 100 non orthogonal dimensions, and the skill of the cubeDBA to optimize it may get you results not much slower than with a simplest cube, but the meaning of measures, and interpretation of trends become much much slower in such a system, leading to simple reports requiring mdx programs being run with cube used as filesystem.</p>

<p><br />
</p>]]></description>
<link>http://www.beyeblogs.com/DataSense/archive/2008/08/bad_dimension_choice_complicat.php</link>
<guid>http://www.beyeblogs.com/DataSense/archive/2008/08/bad_dimension_choice_complicat.php</guid>
<category></category>
<pubDate>Thu, 28 Aug 2008 05:15:00 -0700</pubDate>
</item>

<item>
<title>Microsoft BI and CubeFilesystem</title>
<description><![CDATA[<p>This is my real experience analyzing several Microsoft BI implementations – especially in Financial Institutions. </p>

<p>Programmers have been used to using file system as a place where they store and retrieve data in raw form, and they do significant logic in the code in (c  , Java etc) on top of it to get the required functionality. When given a new toy, MDX they used it exactly the same way – as a filesystem where they can build the logic on; instead of business-dimensional space where they should build the logic in! It is the structure of the dimensional space, and in its various measurements, and in definition of scales, in logic that is embedded in moving measures from scale to scale and in defining measures when all you have is data points that they should be concentrating on!</p>

<p>Not that its wrong. This approach though inefficient is a perfectly accurate way of accomplishing the task. But why do you need to write a program in MDX. You may as well use Java, and aggregate data structures landed to disk if that is how you want to proceed.</p>

<p>Update: Aug 21</p>

<p>The last few blogs have led to some emails from friends asking me if what they are doing isn't apt use of the tool what then is proper use of the tool. I refered them to by previous blogs but promissed them to summarize them here</p>

<p>0) Define the space you are going to work in ie next 123<br />
1) Identify dimensions as entities holding business variables. Is the problem I am looking to measure / model contain this variable as an independent parameter - not is this artifact of data representation. So lookups, data processing artifacts of what ever kind are not dimensions. Next question is are all these variables independent or can they be grouped as abstract entities thereby decreasing my dimensionality - keep in mind world is not 100 dimensional<br />
2) Ensure your dimensions are orthogonal - ie there are no relationships between dimensions. Only relationships allowed are hierarchies within dimensions, and relationships between attributes (e.g. code, desc)<br />
3) Further abstract your dimensions at entity level to reduce dimensionality. It is OK if some attributes are empty. You are not trying to reduce space here.<br />
4) Now you have defined an n dimensional space with multiple scales for the problem. Place the data points in the space.<br />
5) Define measures that represent the observed data at the dimensional hierarchy level of data points.<br />
6) Define the relationship between measures at higher levels of hierarchy and lower / observed levels of hierarchy. These could be simple addition of values in each of the dimensional cells or much more complex formulae.</p>

<p>Now you have defined and placed the data properly in MS BI Cubes. Remember you still need to verify results against data points since error caluculation and propagation is not part of above scheme. You may incorporate it into above but it may be non trivial.</p>

<p>You use MDX to model the data on top of the measures. ie when you need to calulate Mx**2   y  Z**3. Not for finding x, y and z because you have defined your dimensions wrong and need to correct by making coordinate transformations on the fly....or in programming speak you have business logic not incorporated into data structures.</p>

<p>Enjoy. <br />
</p>]]></description>
<link>http://www.beyeblogs.com/DataSense/archive/2008/08/microsoft_bi_and_cubefilesyste.php</link>
<guid>http://www.beyeblogs.com/DataSense/archive/2008/08/microsoft_bi_and_cubefilesyste.php</guid>
<category></category>
<pubDate>Wed, 20 Aug 2008 23:45:00 -0700</pubDate>
</item>

<item>
<title>Hedge Fund Problem -2</title>
<description><![CDATA[<p>Sometimes an intractable problem seems so only because it is misclassified. When you take a step back, look at big picture and think it is no longer an issue. </p>

<p>Case to point is a data problem that some hedge fund companies mentioned - how do you define an slowly changing dimension to take care of the issue where there are multiple identification scheme for the same security and each transaction uses one more more of these schemes. These schemes have non trivial many-many relationship with<br />
each other. This is very much a back office problem.</p>

<p>But for Market analysis does it matter? Would storing it as a junk dimension be OK. Can we run association algorithmns on it to group records with different set of security identifiers together into one record as at some level they represent same security. Or in other words would the information contained in grouped security identifiers be enough for analysis. Hmm once we think it became a simple Junk dimension   Data Mining problem.</p>

<p>Where the mind is free........</p>]]></description>
<link>http://www.beyeblogs.com/DataSense/archive/2008/08/hedge_fund_problem_2.php</link>
<guid>http://www.beyeblogs.com/DataSense/archive/2008/08/hedge_fund_problem_2.php</guid>
<category></category>
<pubDate>Tue, 19 Aug 2008 21:15:00 -0700</pubDate>
</item>

<item>
<title>The Hedge Fund Problem</title>
<description><![CDATA[<p>I have talked to a few hedge funds over last couple of months and they face complex cubes when implementing Microsoft BI solutions without a deep understanding of dimensional modeling. They are looking for MDX experts to help them dig out of the hole, not recognizing that it's a design issue. <br />
Here is the simplified version of the problem. A Hedge fund owns m(n) quantities of n number of securities. Every day they can sell or buy a few more - both the quantities of security held (i.e.  100 of IBM), or number of various securities (IBM, MSFT, HLTH) may change. The values of the securities themselves may change (lets simplify by assuming mark to market).  They  get a feed of daily transactions that were closed. They need to make daily (or periodic) Profit and Loss estimate. This could influence market communications and security transaction decisions. I assume accounting may make estimates of those. But you maybe looking to Build a SSAS cube in addition to study P&L by various dimensions to improve your business, or to build an executive dashboard.<br />
Please read the last two blogs to understand the underpinnings of Dimensional modeling. I'll build on those to explain how we model the above problem. First let me explain how it was modeled wrong in several places I am aware of.<br />
Take the incoming purchases / sales as transactions / facts. Identify dimensions (made mistakes here but that's beside the point).  Build and SSAS cube out of those transactions. Now start writing MDX programs that deduce P&L based on those transactions. Its not that they didn't understand the problem. They understand how to calculate P&L based on the data in hand. But they are using Cubes as filesystem and Writing MDX programs on it.  If you want to do that use Java. Why struggle with MDX. It's not that the business is complex.  It’s just the design is wrong.  The measures you are storing at base level of cube are wrong.<br />
What does your car speedometer show you? It's the speed of the car at that instant. Not acceleration not velocity - just the speed.  Imagine of it showed acceleration and you constantly have to calculate speed in the back of your mind by adding (or subtracting) from initial speed of 0. <br />
So what should be right measures in Hedge Fund P&L Business-Dimensional Space.  That depends on the understanding of the problem. Don't confuse sales transactions with Business events in Business-Dimensional Space. What we are trying to analyze in P&L is the Holdings - not the sales.  Sales are just changes in the holdings i.e. a derived measure / calculation in the cube. Moving measures to reflect Holdings, not sales will solve the complexity issues with the P&L Cubes.  You no longer have to traverse several cells in sales space to deduce P&L. P&L is simple calculation from differences in measures at cell A and B. This will not increase the size of the cube. The size of the cube is dictated by cardinality of the dimensions.  It’s just the ETL that needs to reflect the logic of deducing holdings from initial holdings and sales. </p>

<p>And since I am often asked "There is no dimension that can fix the above problem". A particular facet may be fixed but hack is a hack and the problem shows up elsewhere - in needing complex MDX or unknown data accuracy.<br />
</p>]]></description>
<link>http://www.beyeblogs.com/DataSense/archive/2008/08/the_hedge_fund_p.php</link>
<guid>http://www.beyeblogs.com/DataSense/archive/2008/08/the_hedge_fund_p.php</guid>
<category></category>
<pubDate>Fri, 15 Aug 2008 15:15:00 -0700</pubDate>
</item>

<item>
<title>Facts, measures, aggregations and cubes .....</title>
<description><![CDATA[<p>In my last blog post I quickly skipped from facts to measures to aggregations.  Let me make this technique clearer.</p>

<p>A business transaction can be represented by a point in business-dimensional space. The point has properties assigned to it based on the business transaction. In BI a single point is not of much use. You are more looking at trend as an indicator of what has happened and forecast what will happen and plan for it. One method of finding trends out of individual events with properties in multi dimensional space is to model the business  with parameters and fit the model to the data to get the parameters.  Another alternate way is to define statistical quantities in small units of the dimensions (instead of points). These statistical quantities are measures. They can be as simple as sum or distinct count or much more complex like EBIDA.   Going from individual events to statistical quantities not only reduces the size of data but also its complexity – several hundred stock prices' trend is encapsulated into one DJIA number that signifies the entire market.</p>

<p>Now the units of dimension are not universal or fixed. Infact several nested units can be defined for same dimension. At different scale you may notice different trends in the business as discovered in Data. NASDAQ may be trending down but NASDAQ 100 may be trending up. Or commodities Index may be trending up while DJIA is crashing. However it is non trivial to define and measure statistical quantities from one unit of dimension to another – or in other words move measures from one level in dimension hierarchy to another. Such techniques are called aggregations.</p>

<p>Its also possible to define aggregations when the dimensional space is projected into a space with fewer dimensions - e.g. You take location out of consideration and analyze business performance against other dimensions. Again this is non trivial to define and project measures onto such projections</p>

<p>There are always a large number of projects that can be defined in a multi dimensional space. If you go from 6 dimensional to 4 dimensional space there are 15 different projects possible. All of them may not have business relevance.  A business may still have 5-20 relevant dimensional projects - also called aggregate tables of various reductions (In number of dimensions).  Still it is too complex to maintain and defeats the purpose of data warehousing as a data deliverable to end users.</p>

<p>Solving both the dimension projection, and Dimensional Scale change problems is the invention of data cubes or multi dimensional databases. Multidimensional databases store data with measures already calculated (or capable of automatically calculating) against all dimensional scales and dimensional projects - ie hierarchies and slicing and dicing. They further provide the capability of drilling through - or go from larger to smaller  dimensional scale or even to individual business transactions. </p>

<p>Its every cool if you know what you are doing. If not you can get burnt very fast and not know about it - too many companies are discovering that their MDX programmers are building Data Gyms of unknown accuracy instead of Cubes. That will be subject of future blog.<br />
</p>]]></description>
<link>http://www.beyeblogs.com/DataSense/archive/2008/08/facts_measures_aggregations_an.php</link>
<guid>http://www.beyeblogs.com/DataSense/archive/2008/08/facts_measures_aggregations_an.php</guid>
<category></category>
<pubDate>Thu, 14 Aug 2008 22:15:00 -0700</pubDate>
</item>

<item>
<title>Ensuring Business Integrity in Data Warehouses</title>
<description><![CDATA[<p>Why do we build data warehouses?</p>

<p>A)	To streamline, consolidate and make repeatable information generation and delivery.<br />
B)	To facilitate speedy, accurate and repeatable exploratory data analysis by non technical users which still has Business Integrity<br />
C)	To streamline, consolidate and make repeatable statistical data analysis efforts</p>

<p>In A) I have control over the information and hence can ensure the Business Integrity. In C) I am working with experts in statistical analysis and am just providing them a platform (data & technical) for their analysis. </p>

<p>The troublesome piece is B). This is where the innovation of Ralph Kimball comes in. He has developed a structure for data that is simple without sacrificing accuracy. This dimensional model assigns a transaction to one and only one point in multi dimensional space. The dimensions themselves are modeled after Cartesian system – orthogonal with each dimension having a hierarchy of scales defined. A point in multi dimensional space can have multiple events. Properties of these events are represented as measures at that point. These measures can be aggregated (simple sum, count or more complex functions) to define derived measures at each scale. The calculation of measures of a scale from events or measures at lower scale are non trivial and needs to be handled carefully.</p>

<p>It is the data warehouse architect's goal to fit the data into the scheme described above. While extensions and innovations are always welcome they must be weighed against the requirement to ensure Business Integrity of exploratory data analysis without having to hire a statistician to look over their shoulder or in case of Data Gyms the programmer's shoulder. <br />
</p>]]></description>
<link>http://www.beyeblogs.com/DataSense/archive/2008/08/ensuring_business_integrity_in.php</link>
<guid>http://www.beyeblogs.com/DataSense/archive/2008/08/ensuring_business_integrity_in.php</guid>
<category></category>
<pubDate>Wed, 13 Aug 2008 20:45:00 -0700</pubDate>
</item>

<item>
<title>Data Warehousing the Cloud.</title>
<description><![CDATA[<p><br />
Cloud Computing is the in thing today. Cloud computing is what its name says – to be able to build functional capability without overall structure – or with over all structure handled by someone else (black box computing may be apt in this case). This is different from outsourcing hardware to a data center. It is an approach to providing business functionality that aims to be faster and cheaper.</p>

<p>Properly structured enterprise application development will need use case analysis of business, design of a relational data model (the solution to legacy data cloud problem proposed by Codd),  creation of object layer on top of relational model, and structured screens that lead user through the data entry / selection process. This structured approach ensures the data integrity, functional integrity and hence the business Integrity. </p>

<p>However it is also expensive – not just financially. To understand this look at what happened when structured GUI became Screen Cloud (Web). The innovation and usage skyrocketed while the cost dipped sharply when screen flowing into one another became screens with random links from page to many other pages. However the cloud concept was mostly limited to content and the applications were still structured. The Software and Hardware infrastructure was still structured. </p>

<p>The revolution in inception (if it comes to pass) is in the way IT hardware / software is provisioned and how that is stored to run applications and store and retrieve data. The Infrastructure cloud replaces the IT department with a black box that takes credit card and gives you incremental hardware and software resources to build your applications on. The infrastructure cloud also supports a data cloud i.e. another black box which will replace your Data Architect and DBA. All database maintenance work is automatically taken care of in cloud. The data model will be made non existent using hyper tables that support name value pairs instead of rows and columns. All the left is business logic that needs to be coded by use case in applications. </p>

<p>This is not a pipe dream. This is where we could end up in couple of year’s time. Or NOT.  And this doesn’t have to be a failure. Most business can go back to being strict with the use cases they support in their business processes and put strict controls on the data integrity. I.e. if you call the business and if you spell your name wrong too bad you are not in our system and will not be supported. There will be no search functionality to rescue you. Since there is no structured data model in data cloud there is a possibility of different representation scheme ie variables and their relationships for different use cases even within a single application.</p>

<p>In summary – applications with a few use cases can be built and supported very easily and cheaply on the cloud. </p>

<p>Where does that leave data warehousing? We basically investigate, understand and merge data from different sources with different representation schemes (i.e. data models) into one specially designed representation scheme to support analytic efforts. We will likely continue to do that – this time integrating data from different use cases into a coherent analytic data model. That increases scope tremendously but its somewhat offset by using the cloud itself to build data warehouses assuming that the cloud has a special color (dark clouds?) to support different characteristics of Data Warehousing Loads. The ETL tools we use today have to change to support new requirements. Those changes will wait for market to firm itself. </p>]]></description>
<link>http://www.beyeblogs.com/DataSense/archive/2008/05/data_warehousing_the_cloud.php</link>
<guid>http://www.beyeblogs.com/DataSense/archive/2008/05/data_warehousing_the_cloud.php</guid>
<category></category>
<pubDate>Fri, 16 May 2008 13:45:00 -0700</pubDate>
</item>

<item>
<title>How many Dimensions should a data warehouse have?</title>
<description><![CDATA[<p>Of course that will depend on the subject and requirements. However I refuse to believe that a business is / can be analyzed by 100   or even 25   dimensions. There are not simply that many dimensions.</p>

<p>Then why do I see models with 150   dimensions (e.g. in a big stock exchange). What is the mistake that the data warehouse architect made? </p>

<p>Analyzing the dimensions I find they are not dimensions and lookups fitted into dimensional model. A lookup is a code that has a description attached to it. Basically the origin is the transactional systems where there is a need to abbreviate and minimize the transaction length. I would prefer to convert them entirely from codes to proper names. The codes and descriptions are not dimensions.</p>

<p>This got the stock exchange model to less than 15 dimensions easily manageable and understandable. However it was too late to make the change 2 years and 10 million into the project.</p>

<p>The minute you hear the number of dimensions to be a large number - question your data warehouse architect and get a second opinion.</p>]]></description>
<link>http://www.beyeblogs.com/DataSense/archive/2008/05/how_many_dimensions_should_a_d.php</link>
<guid>http://www.beyeblogs.com/DataSense/archive/2008/05/how_many_dimensions_should_a_d.php</guid>
<category></category>
<pubDate>Sat, 03 May 2008 19:30:00 -0700</pubDate>
</item>

<item>
<title>Top 5 Reasons Why Data Warehouses Fail</title>
<description><![CDATA[<p><strong>Lack of Executive sponsor</strong><br />
A data warehouse touches so many areas deeply, requires work from so many people and opens so many cans of worms in so many miscoded applications that it requires strong executive leadership to workthrough or workaround the turf battles. <br />
<strong>Confusion between Data Warehousing tools and Data Warehouse</strong><br />
Although vendors will tell otherwise Data Warehousing tools are neither necessary nor sufficient to implement a data warehouse. They are in very nice to have category, making the technical details of developing a data warehouse easier and making it less likely to fail due to bad programming for an OTHERWISE sucessful effort. Tools don't turn around a badly understood / designed data warehouse by themselves. <br />
<strong>Confusion between OLTP Application and Data Warehouse Development</strong><br />
Although the tools used are same - databases, programming languages, Hardware software and network infrastructure - they are used differently. Every aspect of IT from Hardware / Software infrastructure to Design, Development and QA is done with massive data flows and need for data precision, accuracy and meaning. <br />
<strong>Work within Corporate IT Procedures</strong><br />
Big Corporations are setup with inflexable IT procedures, built over years of managing IT design, development and Operations within the company. It is very difficult to get them to think through and realize that some of these don't apply to Data Warehousing. Working around their procedures creates additional cost and risk to the project.<br />
<strong>Violate KeepItSimpleS***** Principle</strong><br />
Once the project gets off the ground in a company starved off information for years, there is a tendancy to make the project comprehensive, even working in future requirements and source applications still in design phase. A sucessful data warehouse must have a vision and the ability to reject all requirements that don't fall within that vision. Else the project becomes too vague and too big to succeed.</p>

<p>ZEROTH law of Data warehousing:<br />
Hire the right people. They will take you through ...</p>]]></description>
<link>http://www.beyeblogs.com/DataSense/archive/2008/03/top_5_reasons_why_data_warehou.php</link>
<guid>http://www.beyeblogs.com/DataSense/archive/2008/03/top_5_reasons_why_data_warehou.php</guid>
<category></category>
<pubDate>Sun, 23 Mar 2008 17:45:00 -0700</pubDate>
</item>

<item>
<title>Data Warehousing and Corporate IT Practices</title>
<description><![CDATA[<p>Big Corporations have streamlined IT practices to enable them to support applications in cost effective manner. many of these make sense for OLTP applications and are most likely designed so. However they will make BI development difficult. <br />
e.g.:<br />
BI tasks are more data intensive per transaction compared to OLTP tasks. A BI developer will need the ability to create tables on the fly, to take subset of data to analyze trends or even issues with data. An OLTP developer will likely not need this ability. Corporate policies that prevent such ability impede on the ability of BI developer to diagnose and fix problems. There are work arounds - but it is never a simple matter to backup a terabyte of data, move it across network, find a machine to handle it, and work off it. No one has such powerful machines laying around just in case....</p>]]></description>
<link>http://www.beyeblogs.com/DataSense/archive/2008/03/data_warehousing_and_corporate.php</link>
<guid>http://www.beyeblogs.com/DataSense/archive/2008/03/data_warehousing_and_corporate.php</guid>
<category></category>
<pubDate>Fri, 14 Mar 2008 16:30:00 -0700</pubDate>
</item>


</channel>
</rss>
