January 30, 2007
Will Web Services Degrade the Relational Idea?
Sometimes things are so obvious that we never think to actually utter them. It can be an enlightening experience to try to tease these things out of our various mental models. It can just as likely be sort of upsetting too, but Iím confining this to discussions about technology.
For instance, what is the big deal about data integration? If I have a ton of data in SAP R/3 and I want to use it somewhere else (and by ďuseĒ that could mean extract and transform, or just read or something in between), and that data, no matter how obscure, is in a relational database, why canít I just read it?
That previous paragraph is a rhetorical question, of course.
One of the reasons it is so hard to explain to people (clients) why this is so difficult is that data IN a relational database is not necessarily modeled that way. What does that mean? It means, if your application program uses the relational database as a big bag of persistent data, but carries all of the relevant logic within its application code, it isnít, strictly speaking, relational.
When relational databases first emerged for business applications, many IT shops were urged by their vendor (IBM) to move things to DB2. Application vendors, in the pre ERP days, like Walker Interactive, did crazy things like move the entire code block of the chart of accounts to a single, concatenated text field. This made it impossible to select or join anything, one had to read the transactions sequentially and parse them in COBOL code to do anything with the system. This is the opposite of the desired effect of a relational database, where the data is supposed to be arranged in a logical schema (well, a physical representation of one), and at least some semantics are understandable such as ACCOUNT_NUMBER means account number and DIVIVSION means division.
But with the massive functional reach of something like SAP R/3, the database is used as a dumb container and the real data is first buried under a few layers of indirection (because the application code is object-oriented, and some data is class data and some is object data, though that is a wild oversimplification).
R/3 also compresses what should be some 10's of thousands of tables into a few thousand by combining many, unlike tables into one. Learngin all the tricks and hidden hard and soft pointers is possible, but it doesn't take into consideration the amount of customization that is routinely applied before one of these systems goes live.
But this begs a very important question Ė whatís wrong with that? Why in the world would you want to try to grab raw data from something as complicated as R/3? Wouldnít you rather rely on its own internal services to handle requests and serve up exactly what you asked for, without ever having to worry about things changing? In fact, in a web services world, isnít that exactly where we want to be?
Something to think aboutÖ
DIdn't Anyone Notice Incorrect Reports Before MDM?
The current issue of TDWI's What Works (Volume 22):
First of all, I want to say that Phil is a serious professional, I like him personally and have a lot of respect for his work, and he is an excellent analyst. But, like all analysts, myself included, some or all of our research is funded by vendors. In Philís case, this research was funded by Actuate, ASG, BusinessObjects, DataFlux, Informatica, SAP, Sunopsis and Teradata. At a certain level, the analyst is attempting to uncover the sentiments in the market toward certain ideas and trends, but the sponsors are expecting a document that validates their current marketing of some concept and/or product. This is because buyers mostly operate in herd mentality, and a favorable report from a respected analyst is a useful tool. It's a difficult position for an analyst. Because of this balancing act, it's important to learn how to read these survey results.
First of all, survey design is a very complicated process, a craft really, if you want to uncover the nuance of certain phenomena. But surveys designed for marketing purposes are not very rigorous. Part of it has to do with how you select the sample, and part it has to do with how you ask the questions. I havenít examined the methodology of the full report, but itís likely that TDWI members were polled and responses self-selected through a web-based survey tool, so bias is significant. But what is more interesting is the questions themselves. For example:
ďHas your organization suffered problems due to poor master data?Ē 83% said yes. Thatís like asking if youíve failed to plan sufficiently for your retirement. Without some quantification, who would ever answer no to question like that? A better question would have been, "Is your organization impeded in meeting its goals in a measurable way due to poor or lacking master data?"
So from the beginning, the impression is made that 83% of companies feel they have suffered problems due to poor master data. This defines the market. The number will (and already has been) quoted over and over. Now the report goes on to list the categories of problems theyíve had such as Inaccurate Reporting, Arguments Over Data, Data Governance problems, etc. These questions are so open-ended and soft they are like asking people if they ever wished they had more money, were taller or had more time for leisure activities. In fact, itís amazing the response rates werenít higher than 70-80%.
Dissecting the answers on benefits from MDM, there were 741 respondents, but only 54% claimed to realize benefits, or roughly 375 of the total. Yet when the kinds of benefits are charted, the denominator for the response is 375, not 741, inflating and distorting the impact of each benefit. In other words, we are led to believe that 76% of the respondents achieved improvements in data quality when in fact, it was only about 40%. Other soft questions, those that are designed to get a desired response, are:
Improvements in Data Quality: How much of an improvement would it take to answer yes to this question? Is a 1% improvement enough, even if it cost $5,000,000 to implement? A better question would have been, "Can you measure the effect of improvements in data quality as a result of this initiative?"
Better decision making: Who can actually measure this, how much better, and what state was decision making in before the initiative? I wouldn't substitute another question here because I don't this is an objective question.
Easier auditing of informationís origin: Again, is it just fractionally better or a lot better?
But the one question that really got my attention was ďAccurate Reporting.Ē This is tied with Improvements in Data Quality as the most frequently cited benefit (about 30%) but it begs the question, how inaccurate was the reporting before MDM? After all, these companies didnít crawl out from under a rock, they are TDWI members and have presumably been around Data Warehousing for a number of years. They may have won awards for their efforts, presented case studies of their successes at conferences, who knows? Were their reports inaccurate last year or the year before? If we went back in the literature and checked on them, would they have mentioned that their reports were inaccurate and they were just waiting for a solution, such as MDM, for which they didnít have a name yet?
Data Warehousing has been around for 20 years and accurate, integrated data is the heart of the solution. How is it that the leading proponents of Data Warehousing, especially the visible and vocal ones, failed to notice for 18 years that everyoneís reports were inaccurate until MDM came along? I know the answer will be that MDM is broader than Data Warehousing in its reach, it is designed for the enterprise, not just analytics, but this begs one big question Ė isnít reporting part of Data Warehousing??? If reports were inaccurate before MDM, then Data Warehousing didnít do its job and you should take the experts and gurus to task on this one. Unless Iím mistaken, the lionís share of output from data warehouses has always been, and continues to be, reports in one form or another. If theyíve been inaccurate, what does that say about existing data warehouses, data warehouse products and data warehouse methodologies?
On the other hand, for an industry to promote a new solution (MDM) that solves a problem (inaccurate reporting)that was supposedly already solved by prior approaches (data warehousing) is troublesome. In 2003, TDWI awarded special recognition in eight categories to companies that achieved excellence in data warehousing. You can read about it here: http://www.dw-institute.com/display.aspx?ID=6652
The names of the judges are also listed. Was Inaccurate Reporting overlooked as a criterion? With 81% of today's respondents reporting it as a problem, it seems pretty likely that at least some of these award recipients were experiencing it in 2003. Did everyone miss it?
I think the issues cited with the survey are important, but not dangerous. The sponsors of the study are satisfied that they have some ammunition for their sales force and the analyst has raised the awareness of an important, emerging issue. Few will come away remembering which benefits ranked in which order and for what amounts. Rather, they will have the impression that MDM can solve some of these problems for some people, if not necessarily for them. That's not so bad.
November 9, 2006
Privacy and Syndicated Data
Speaking to Data Warehousing, BI, CDI/MDM or other data management groups is an essential part of what I do, but it is always refreshing to get out of the basement once in a while to enjoy the opportunity to address groups of people outside of our little industry.
I've had the privilege every year for the past decade to be invited to come and talk about technology to the Society of Actuaries. Their annual conference is attended by literally thousands of people, though the technology section is a smaller subset. Between regulatory, professional and risk management issues, technology is not the highest priority. Nonetheless, they are a wonderful group of intelligent, informed people, many of whom suffer from data dysfunction of the most severe kind since a large part of their work is analytical.
One of the speakers was a statistician who presented material about his firm's use of commercially available data that could be used for underwriting, say, group health insurance. This really disturbed me. For example, when individuals fill out health questionnaires as part of the application process for small group insurance (and remember, the engine of economic value creation is small business), the are strict, regulatory limits on the questions that can be asked. But apparently, for as little as $.10 per person, according to this fellow, any underwriter can purchase up to 3500 bits of information about anyone (in the US, that is, where privacy is a debatable issue).
So, even though a health insurance underwriter can't discriminate based on gender, you could, theoretically, determine the gender makeup of the group by delving into the food the person eats, the car they drive (like a minivan), or the gym they belong to. It gets worse, much worse. When he showed a table of "diseases" he could predict based on this lifestyle data, the number one "disease" was ------ pregnancy!
I don't know about you, I think this is going too far.
To me, filling out a health questionnaire means that that is the extent of the health information that will be revealed. In point of fact, that isn't true, because they may ask for a attending physician's statement or go to the MIB (Medical Information Bureau) for more data, but both of those are indicated as possibilities in the fine print. What really irks me about this topic is that there is no explicit notice that data about food purchases (fast food, diet food, vegetarian, gourmet), self improvement (health/fitness, dieting/weight loss), fitness activities (aerobics, running, walking, tennis, golf), physical inactivity (television time, computer time, board games, stamp collecting), stress indicators (financial problems, family size and status, occupation) tobacco preferences, alcohol consumption, travel, vehicle type, etc. is going to be used. It just seems sneaky to me.
I asked the audience how they felt about this, and I was sort of shocked that no one had an objection. I probed a little more and asked how they would feel if someone was snooping around their grocery store purchases to evaluate them. One finally spoke and said, "I personally opt out of anything like that so there is a minimal amount of data about me."
"So you don't find it a little hypocritical that you opt out yourself, but you would use this data to evaluate others," I asked?
I got a shrug and a half-hearted, "There's lots of hypocrisy in the world."
That isn't a good enough answer. Now health insurers will tell you this a $1.2 trillion problem, but how is denying people health insurance based on the car they drive or their likelihood to, heaven forbid, have children, going to solve the problem? The problem with healthcare is the cost and the health insurers created the problem by indemnifying every outrageous cost until one day their monster came home to roost.
For more information about this see http://www.contingencies.org/janfeb06/what_you_eat_0106.asp
I'd like to get some comments on this (the privacy issue, not health insurance), either here or to me directly at firstname.lastname@example.org