|
April 28, 2008
Issue 19: The Compress Engine - The Netezza Philosophy
"‘To be is to do.’ - Immanuel Kant "‘To do is to be.’ - Jean Paul Sartre "‘Do-be-do-be-do’ - Frank Sinatra" Kurt Vonnegut, Jr. (Nov 1922 - Apr 2007)
In the news today: the Compress Engine In 1783 Immanuel Kant wrote, "David Hume woke me up from my dogmatic slumbers," and revolutionized the way humanity thinks about metaphysics. Almost 220 years later, Netezza set out to achieve a similar goal redefine analytics. When the first NPS® data warehouse appliance was introduced, the market released itself from yet another dogmatic slumber and realized that there is a different, better way to do data warehousing; a way without compromise, a way without limits.Netezza has helped to reenergize the data warehouse market in creating and leading the data warehouse appliance category.
- "Every time you turn around you see another industry that's facing a tidal wave of data and they need to understand what this data is saying. Many of them have data volumes in this range that they haven't been able to afford to analyze, as much as they'd like to. ... [Netezza] can deliver that analytic capability, and at a very attractive price." - Richard Winter, Winter Corporation, from Netezza will scale its appliance to petabyte range, InfoWorld (January 2008)
- "This is what Netezza has done in the data warehousing market: it has totally changed the way that we think about data warehousing... So the bottom line is not just that Netezza’s entry into the market was a black swan event but that that event has not ceased to unfold." - from Netezza: a black swan by Philip Howard, Bloor Group (October 2007)
- "Appliances are here to stay and are revolutionizing the data warehouse industry." - from Business Analytics Appliances Are Here to Stay, by Dan Veset, IDC (June 2006)
- "The term data warehouse appliance was coined by Netezza, and this vendor has blazed a trail by proving the concept and educating the market." - from Defining the Data Warehouse Appliance, by Philip Russom, TDWI (August 2005)
|
Since 2002, Netezza has been repeatedly breaking the latency barrier and challenging the boundaries of data analytics. Since our first release, we have been continuously refuting the alleged mutual dependencies that became the building blocks of the industry’s dogmatic misconceptions; namely the expensive nature of performance, the necessary complexity of the analytics architecture and the unavoidable limits of scalability. With today’s announcement of the Compress Engine, Netezza disproves yet another myth the inverse relationship between data compression and query performance.
The architectures of traditional data warehouses, steeped in a legacy of serving OLTP applications, were not designed to handle the ever-growing amounts of data combined with larger and more complex user workloads and shrinking data latency requirements that characterize the modern enterprise. Regulatory compliance, electronic commerce and the need to process and analyze all data in a matter of seconds has pushed the capabilities of traditional data warehouse systems to their limits. In reaction to the data capacity pressures, vendors introduced compression; not as an enhancement but as a compromise solution that allows for further data growth at the cost of processing performance.
Traditional compression approaches, used by several of the competing data warehouse vendors, typically result in performance degradation to accomplish the compression effect. Netezza’s addition to the FPGA-Accelerated Streaming Technology (FAST) Engines framework - Compress Engine - utilizes its innovative streaming architecture™ not only to increase the system’s storage capacity by 2-4X but actually boost overall streaming query performance by a factor of about 2X (100%). All this is achieved without requiring any tuning or administration, and it is in fact a software-only upgrade that enables Compress Engine on the Netezza appliance.
It’s actually really cool technology, obviously something we love to rave about. Late last year, I wrote about FAST Engines in this blog. We’ll use that as a starting point and dig a level deeper into how Compress Engine works. I’m sure it will tickle the fancy of the geek in you!

The NPS system employs a patent-pending method for compiling (yes, compiling) columnar data in all the tables of the database as it is being written to disk e.g. during load, insert or update operations. The process converts row-based data into column streams that are independently compiled to replace the original data in the columns with a stream of "instruction sets" for the FPGA. The "instructions" themselves are much smaller in size than the data they replace, resulting in a highly compressed data stream emerging from the process.
While the compression occurs on columnar data because of the inherent compressibility within database columns, the compressed data is reassembled in rows before being written to disk. Row-wise storage of tables avoids the data scan complexity associated with columnar stores and ensures that scanned data can be efficiently parsed and processed without the need to reconstitute it from multiple sources. The compressed data uses disk much more efficiently and increases the data density of NPS systems by 2-4X - in some cases substantially higher - allowing customers to scale their NPS data warehouse systems into the hundreds of terabytes of user data.
But if the NPS system’s data compression and scale brought the system’s performance to its knees or severely limited performance speedup due to compression (as it does on many of those other systems), it wouldn’t be so great, would it? The beauty of the Netezza way of providing data compression is that not only does it have no negative impact on performance, but it actually increases query performance by up to 100%!
As the compressed data is read off the disk, it is passed through the Compress Engine which applies the instructions embedded in the data stream to restore it to its original form. Our compilation algorithm ensures that this decompression process can be performed entirely in silicon, at wire speeds. Each physical block scanned from the disk can mushroom into 2 to 4 or more times its size in memory without incurring any overhead in processing time i.e. 2 to 3 times more data is scanned in the same amount of time without any increase in system hardware! Our internal benchmark testing reflecting real customer configurations and workloads has shown an overall 2.2X increase in streaming query performance through the use of Compress Engine.
This software-only enhancement, enabled by our unique architecture, is only the beginning. As we continue to develop our platform, we are investigating further enhancements to the Compress Engine or the addition of new FAST engine(s), aimed at directly increasing streaming performance on the NPS system.
Our philosophy and aim is to continue to shake the industry out of its dogmatic slumbers by extending the price/performance advantages of our products; showing that there’s a different way to do data warehousing and advanced analytics. One where performance and scalability are neither the result of expense nor complexity, where you can get more performance from compression, where you do have the power to question everything™ ...
Share:
Posted by Phil Francisco at 8:30 AM
| Comments (0)
April 21, 2008
Issue 18: Teradata's "Me-too" Model 2500 – welcome to the Data Warehouse Appliance club ...finally
"Imitation is the sincerest of flattery." — Charles Caleb Colton (1780-1832), from his Lacon, Vol. I, published in 1820
Welcome to the Data Warehouse Appliance club - another validation of an important, growing market segment Well, well, well! "Only" eight years after Netezza coined the term and invented the market segment, Teradata today finally officially entered the Data Warehouse Appliance market. Though it’s a bit late, and certainly behind a number of other vendors, perhaps today’s entry will put an end to Teradata’s vacillating over whether they 'invented' the concept or not, were an appliance or not, or whatever. In the past couple of years, it seems Teradata spokespeople have gone out of their way to say their product was simultaneously a data warehouse appliance and absolutely not one — even booking appearances on panels of data warehouse appliance "vendors". Certainly their announcement is another validation that the role of Data Warehouse Appliances is an important and growing one not only in the current market, but for the future as well.
Derivative Marketing and a "Repackaged, Warmed-over" Product?
Teradata is positioning this new product as being, "simple, powerful and cost-effective" — which to our way of thinking sounds much more than a little derivative from Netezza’s long-standing value proposition: "Performance, Value and Simplicity", but I’ll leave it to the reader to decide if you think so. Our reading of the Teradata announcement sounds like just another larger vendor’s "repackaging" alternative to respond to the competition. Like others before them such as IBM and Oracle, it appears that with the 2500 model Teradata has done nothing more than cobble together a collection of elements from the company’s model 5500 systems, repackaged and sold as an appliance.
Powerful. Um, How’s That Again?
And while anyone who is serious about the appliance segment of the data warehouse market (like Netezza) has focused on delivering systems that can scale to highly complex, enterprise-wide, high performance systems, we think the 2500 will struggle to deliver even modest performance for just 6 TB in a single equipment rack.While Teradata is quoting just over 6 TB of user capacity per two nodes in this new system, let’s remember that they have been advising customers for the past year not to put more than 1.5 TB against each of those same dual-core CPU nodes. Which is it? Is the 2500 underpowered for its 6 TB data capacity per dual-node rack, or has Teradata been advising its model 5500 customers to pay at least 2X too much for their data warehouse systems for the past year? Time will tell whether Teradata has made other compromises to the 2500 model in an attempt to limit its impact on its flagship products (5500 and the new 5550). Beyond its underpowered nodes, have they sacrificed anything else like workload management or system availability, or even the system's ability to handle highly-interactive, operational applications? As the days and weeks help raise the shroud covering the model 2500 further, we’ll know more. For now though, it just feels like "me-too" imitation.
Share:
Posted by Phil Francisco at 9:30 AM
| Comments (2)
January 15, 2008
Issue 17: Information Arbitrage - When big data plus big math pays off
by Jit Saxena, Netezza Chairman and CEO
"If you are a Big Dog and you are not persuaded by data, then in God we trust...but everyone else, bring data." - Jane E. Shaw, retired Chairman and CEO of Aerogen, Inc. and current member of the Intel Board of Directors (quoted from PowerSpeaking Inc.)
More and more companies recognize the power of analytics as part of their competitive strategy. But most solutions only provide a glimpse of what can be achieved. What is the potential impact when performance barriers fall away? In this post, I’d like to explore the possibilities and introduce a few examples of companies leveraging the intelligence in their data in new and unexpected ways. After all, competing is good, but winning is better.
In finance, the term arbitrage refers to the ability to find and exploit market disparities (hedging strategies monitoring currency or securities fluctuations being prime examples). Most arbitrage opportunities are very time-sensitive - you have to recognize value in an overlooked stock then swoop in to buy it before others take notice, get the same idea and drive up the price. On Wall Street, an arbitrage virtuoso, able to consistently spot untapped potential that others miss, is worth his or her weight in gold.
Leaping through Tiny Windows
The term Information Arbitrage has many similarities to its finance equivalent, and it’s a good way to think about the impact that analytics can have on a company or even an entire industry. Information arbitrage is about finding game-changing intelligence buried in vast, unappreciated data assets, and exploiting it to leap ahead of the competition. Like a financial investor, the Information Arbitrager takes advantage of an opportunity before the window slams shut (which can be very fast indeed).
Companies in certain industries make particularly good arbitrage candidates. These are companies dealing with "Big Data" - tera-scale or even peta-scale databases, and a constant flood of incoming data. Telecommunications, eBusiness, RFID retail applications and online advertising are a few segments that come to mind. Often the operational data is changing very quickly, and key insights are only found at a very granular level. Now suppose this normally takes hours or days, and one company can suddenly do it in minutes, seconds or even sub-seconds. As Netezza customers well know, this kind of intelligence disparity can have dramatic implications, both for that company and its market.
For example, telecommunications is a high-volume, low-margin business. Constant changes in network utilization demand real-time decisions about rating and pricing structures for an operator to stay competitive. By running pricing scenarios against billions of call data records, and by examining individual customers to determine their current calling patterns and preferences, iBasis, a major telco provider, knows exactly what options to offer each customer. In contrast, competitors might only see that customer as part of a larger segment measured at some time in the past, and come up short with their offers and pricing.
Something Has to Give... Or Does It?
There are several challenges to Big Data analytics that make arbitrage opportunities hard to pursue. Predictive modeling, optimization and other analytic applications are much more processor intensive than the SQL queries used in standard business intelligence applications. When complex algorithms and gargantuan databases converge with real-time business demands, something usually has to give.
Many companies find they are unable to fully exploit their growing data holdings, and have to make do with sampling or high-level summaries rather than the complete, granular data they often want to examine. But using partial or high-level data can be dangerous; even the most powerful algorithms can suggest spurious or meaningless conclusions when they are applied to insufficient data. Companies may also lose hours offloading data from the data warehouse to an external cluster of processors to run the analysis. With all these approaches, the result is an incomplete solution that provides just a hint of the possibilities of analytics, because that’s all the current technology is capable of delivering.
Consider the problem of optimization, for example. Optimization solutions play a key role in helping companies target the right customers, make the right offers, determine manufacturing volumes or accurately price products to take full advantage of market conditions while minimizing expenses. Depending on the problem being addressed, an accurate optimization solution needs to account for many variables and constraints such as products, branches, budget, time, contact channels, offer history, market segmentation and privacy preferences, to name a few.
Due to the multiple permutations and combinations among the different elements, even a simplified optimization model limited to only a month of data, a thousand customers and ten different offers results in an astronomical solution search space of 2 to the power of 10,000. Just to put things in perspective, the number of atoms in the observable universe is about 1081, just a few more variables away.
The "Big Math" at the heart of this kind of analysis pushes most processing technology to its limits and beyond. As the number of variables and restrictions increases linearly, the algorithm amplifies exponentially, often reaching the complexity class NP-Complete. As a result, companies are forced to compromise in the thoroughness of the analysis and/or the response time they are willing to tolerate. Most optimization efforts look at small snapshots of the total data available (for example, only the last month’s data), and make use of a range of techniques such as Linear, Dynamic and Integer Programming, Lagrange Multipliers and Cluster Analysis that reduce the level of complexity in various ways, all in an attempt to reach an actionable result in a realistic timeframe. But even with these approaches, companies are faced with costly infrastructure requirements, incomplete views of their data and lengthy response times resulting in stale data or missed arbitrage opportunities.
But what if you could bypass the existing performance limitations and get crucial intelligence much faster than before? For example, what if a database marketing company could use complex algorithms to get accurate optimization results days before the market could adjust? Or a retail franchise could precisely adjust the prices of thousand of products daily for each of its stores? Or a credit card company could run customer scoring algorithms one hundred times faster than its competitors? Or a financial services firm could run real-time Monte Carlo simulations on terabytes of data to manage risk? What impact could advantages like these have on a business? It’s fair to say the difference would be game-changing, providing a major competitive advantage and the ability to enter new markets previously out of reach.
These capabilities are not just marketing fantasies or future visions - they’re in use today.
Big Math, Big Data and No Barriers
Making these Information Arbitrage opportunities possible is precisely what Netezza does. Our streaming analytic appliances are built for running complex mathematical models on huge data sets, with results in a fraction of the time required by other technologies. Sophisticated analytic applications run "on stream" in the data warehouse, against all the records and detail that need to be examined. There’s no need to settle for summary data or aggregations, or ship data to another system for analysis. (We’re also constantly making our appliances better. Our recent doubling of performance is just the latest Netezza breakthrough.)
Through the Netezza Developer Network, we’re helping developers worldwide use the Netezza architecture to create a new generation of analytic applications that were previously impractical, unaffordable or simply impossible. When exploiting an arbitrage opportunity means leveraging Big Data and Big Math, Netezza’s streaming architecture is simply inherently faster and more efficient than other technologies. Of course, our customers already know this - and with appliance simplicity and low purchase price, information arbitrage pays off even more.
The bottom line is: when Big Data meets Big Math, great things become possible for our customers and their businesses, enabling them to:
- Use Information Arbitrage to take advantage of time-sensitive opportunities
- Rapidly run multiple scenarios and sensitivity analyses in near real-time
- Make use of all the available data, all the time while their competitors are still struggling with reduced visibility from sampled or aggregated data
When the first Netezza appliances burst on the scene in 2002, their ability to query giant databases with unprecedented speed upset a lot of preconceived notions about the limitations of technology and what companies can do with their data. Advanced analytic applications take processing complexity to a much more challenging level, and once again the capabilities of our appliances are revolutionizing the market and capturing the imagination of our customers.
Jit Saxena
Share:
Posted by Phil Francisco at 10:00 PM
| Comments (0)
|
April 2008
| Sun |
Mon |
Tue |
Wed |
Thu |
Fri |
Sat |
| |
|
1 |
2 |
3 |
4 |
5 |
| 6 |
7 |
8 |
9 |
10 |
11 |
12 |
| 13 |
14 |
15 |
16 |
17 |
18 |
19 |
| 20 |
21 |
22 |
23 |
24 |
25 |
26 |
| 27 |
28 |
29 |
30 |
|
|
|
Archives
Recent Entries
RSS Feeds

|