DATA-Beat A DATAllegro Blog on Data Warehouse Appliances

Site Map
DATAllegro Data Warehouse Appliances (949) 680-3000
 
From Terabytes to Petabytes

Blog Home

May 2008 Archives

May 1, 2008

Columns & Rows

Over the last couple of years, there's been a lot of talk about the advantages of column-oriented databases for data warehousing, with Michael Stonebraker from Vertica being particularly vociferous and bold in his claims that row-based databases are going to be completely replaced. As a vendor of a row-based database, I obviously have a vested interest in refuting his claims, but I'm going to try my best to be even-handed in discussing this issue.

The Claims
If you were new to database technology and read some of Stonebraker's articles, you might be forgiven for thinking that column-oriented databases were a completely new invention and were set to sweep row-oriented databases from the data warehousing market.

He claims that column-oriented databases are 10-50x faster than traditional row-oriented systems and offer significantly higher compression ratios, thereby bringing down the cost. Benchmarks against Oracle are usually put forward to back up these claims.

The Reality
The fact is that column-oriented databases have been around for some time. In the data warehousing market, long-established (but not very successful) examples include Sybase IQ and Sand.

There are some advantages of column-orientation for DW workloads. For example, data compresses slightly better when stored in columns (DATAllegro compresses between 2:1 and 6:1 depending on the content of the rows, whereas column-oriented systems claim 4:1 to 10:1). Also, some queries (i.e. those that only access a few columns) will perform better.

However, in most real-world implementations, these advantages don't make a great deal of difference.

At the end of the day, column orientation is just one approach to limiting the amount of data read for a given query. In effect, it's an extreme form of vertical partitioning of the data. In modern row-oriented systems such as DATAllegro, we use sophisticated horizontal partitioning to limit the number of rows read for each query. We're also working on clever usage of materialized view technology to limit the number of columns we need to read. The end result is very similar performance to that claimed by Stonebraker i.e. 10 to 50x that of traditional databases such as Oracle.

May 19, 2008

Netezza

Netezza was the first disruptive data warehouse appliance vendor to hit the market.

When we started out in 2003, we took the position of saying that we were very similar to Netezza. However, over the last few years, our strategies have diverged a little. I therefore think it might be useful to explain the differences between the two products in an effort to help potential customers make a more informed choice.

Key Differences between Netezza & DATAllegro

Over the course of the next few months, I'll be going into detail on the differences between DATAllegro and Netezza. To kick things off, here's a brief summary:

The above spider chart rates DATAllegro and Netezza against other products in the market. The outer edge of the chart is "best in class". You might think that I've been self-serving in choosing the categories, but this is my blog after all! Let's go through each category so I can explain my reasoning:

Non-Proprietary HW

Netezza's snippet processing units (SPUs) are entirely proprietary, since they consist of a general purpose CPU, an FPGA and a hard disk drive on a custom blade. Netezza has occasionally made the argument that the underlying components are standard, off-the-shelf parts. While that's undoubtedly true, it's also true for pretty much any other piece of proprietary hardware. The only pieces of non-proprietary hardware in a Netezza appliance are the head units (standard HP servers) and the GigE network switches. As a result, Netezza scores very low in this category.

In contrast, a DATAllegro appliance uses completely standard servers (from Dell or Bull), storage (from EMC) and networking (from Cisco and Qlogic). Hence we get a best in class score in this area.

"DATAllegro: This firm's open, hardware-independent approach to the data warehouse appliance is catching on." Intelligent Enterprise, "36 Companies to Watch," January 2008.
Read Article

Non-Proprietary SW

Again, Netezza fares poorly in this category. While they started out by leveraging the Postgres open source database code, they've effectively rewritten most of the system at this point. The SPUs don't even run a mainstream OS such as Linux (although the head units do). Finally, Netezza’s software that runs on the head units is entirely proprietary.

Footprint

Netezza has basically one product line, with each rack holding 112 SPUs and 12.5TB of user data (with optional compression, this goes up to around 25TB).

DATAllegro takes a slightly different approach, with two different options in our range of data racks. Our DR1530 offers up to 30TB per rack and is therefore slightly better in terms of user data footprint than Netezza. We also offer a DR200 for very large-scale systems that offers 200TB of user data per rack. We're therefore better on this metric.

Watts per TB

Our DR1530 is slightly worse than Netezza on power consumption per TB, but our DR200 is significantly better, so I'll score us even on this point.

Price per TB

The latest information we have is that Netezza is list priced at around $100k per TB, which puts them in the mid-range of comparable offerings. Depending on the data racks used, our list prices are between $8k and $50k per TB, which is clearly substantially cheaper than Netezza.

Install Time

Netezza used to have a significant advantage over us in this area. However, our v3 product, which became available a little more than year ago, closed the gap and can now be installed in just a few hours.

Physical Design

Physical database design is very straightforward in Netezza. The DBA has to simply decide which column to distribute each table on across the SPUs. As data is loaded, the Netezza system automatically populates zone maps, which are effectively statistics for each 3MB page in the table. The query optimizer makes use of the Zone Maps when deciding which pages to read in order to satisfy each query.

There's no doubt that we lag behind Netezza a little in this area, although we are catching up fast and are already ahead of all of the other contenders. Like Netezza, we don’t generally require indexes (although they are available). Also like Netezza, the DBA must choose a distribution column to spread each table across the nodes. In addition, the DBA must decide how to set the multi-level partitioning up on our system. This is usually very straightforward and familiar to any experienced DBA. For example, in most installations, the fact tables will be date ranged and then hash partitioned by the foreign key of the largest dimension table. Other tables will typically just be hashed on their primary key. The process is simple and usually just takes a couple of hours. Changing the partitioning scheme is also generally very straightforward and very fast.

ETL & BI Compatibility

Both products are essentially the same in this area - i.e. they work with all mainstream BI and ETL tools.

Scalability

This is an area where DATAllegro has a big advantage. Whereas Netezza currently maxes out at 200TB, we already have production systems that exceed 400TB. In addition, our DR200 racks can be used to build data warehouses of more than 10PB of user data at very low cost.

Sequential Performance

Due to the way our compression code works, DATAllegro’s current products are optimized for performance under heavy concurrency. The end result is that we don't use the full power of the platform when running one query at a time. This can be a problem in proof of concepts against Netezza, since the first results people often look at are simple sequential query runs. There's no doubt that Netezza is very strong in this area. However, we don't feel this reflects real-world workloads.

Mixed Workload

In contrast to our performance under simple sequential query runs, our platform performs extremely well under a complex mixed workload with heavy concurrency.

There are several reasons for this:

  1. The compression code in our platform is optimized for use under concurrency.
  2. The Infiniband backbone in our appliances uses minimal CPU power for data movements, compared with the GbE connections used in rival appliances such as those from Netezza. As a result, there's more CPU power left over for running queries.
  3. InfiniBand can also move data around 10 times faster than GbE, which is a huge advantage for some complex queries.
  4. Our sophisticated use of multi-level partitioning and clever workload management allows us to run a mixture of short queries and long queries very efficiently and with very consistent query times as seen by users.
  5. Netezza's FPGAs run out of silicon real estate at around 16-20 concurrent queries. The end result is less consistency in query run times as the system comes under load and starts queuing.

The end result is that we're consistently beating other platforms when running a complex workload. Even Teradata can't compete with us on this metric.

"One [differentiator where DATAllegro] does well is in situations of mixed workloads, where as well as queries there are concurrent loads and even updates happening to the database." Andy on Enterprise Software, "A Lively Data Warehouse Appliance," February 2008.
Read Blog

Load Speeds

We've experienced consistently faster load performance than Netezza in all recent POCs, especially in near-real-time scenarios.

Distributed DW

In most large enterprises, data warehousing is a distributed problem - business units need to create data marts that meet their own requirements and SLAs while fitting in with data governance and other requirements that cut across the entire enterprise. Until recently, building an effective, large scale, distributed DW to match the shape and needs of the business was impossible, due to the low speed of data movement between centralized hubs and business unit specific spokes.

Last year, DATAllegro introduced its grid technology that provides centralized metadata management for a collection of appliances, together with very high speed, parallel data movement between the appliances. The technology has already been deployed successfully in a number of very large-scale enterprise DW implementations.

No other DW vendor has an answer to this challenge.

"All together, the hub-and-spoke grid approach is a concept that puts DATAllegro on a different playing field than other database vendors. Rather than trying to build the single fastest database system, this approach focuses on building the most effective enterprise data management infrastructure, which is ultimately more important than the single fastest system." Tom Briggs, Full Table Scan, "Getting to Know DATAllegro, Part II," May 2008.
Read Blog

In DBMS Analytics

Netezza stirred up the market last year when it announced the availability of in database analytics - effectively allowing third-party code to run on the SPUs, thereby taking advantage of the massively parallel architecture.

The amusing part of this is that user defined functions (UDFs) have been available in most database products for years - even in MPP systems such as DATAllegro. Hence, Netezza's 'innovation' in this area was more in the area of marketing than actual technology. Having said that, they have backed up their UDFs with some interesting tools and partnerships, so I'll give them a lead in this area.

Checking it out for yourself

The above analysis is obviously somewhat subjective and yes, I'll admit it, biased. However, there's an easy way to find out how the two products stack up against your requirements and that's to run a proof-of-concept using your own data and queries.

Adding DATAllegro to an existing POC adds very little effort to the overall process from the customer's perspective, since we do all of the work. Also, we've seen Netezza respond to competition from DATAllegro with some very heavy discounts, so you might save yourself a lot of money, no matter whom you choose in the end!

For materials related to concepts discussed in this blog entry, click the following links:

White Paper: Hub-and-Spoke: Getting the Data Warehouse Wheel Rolling
White Paper: Data Warehouse Appliances: The Benefits of an Open, Non-Proprietary Platform
White Paper: Using Grid Technology to Build a Hub-and-spoke Data Warehouse Architecture

May 23, 2008

Netezza's EMC deal

Netezza just announced an OEM deal with EMC whereby they will ship an EMC AX4 storage unit with their appliances. Some analysts are talking about this in similar terms to our relationship with EMC. However, as Curt Monash points out, the arrangement is completely different.

On the technical front, the EMC storage is just used as a staging area for loads, backups and exports. This is the same as our backup and landing zone nodes, so Netezza is really just playing catch-up. In the Netezza architecture , data is still stored on disks that are embedded in each SPU. As a result, they are not taking advantage of EMC's greater reliability and fault tolerance - unlike in DATAllegro’s architecture. In addition, Netezza remains a highly proprietary platform.

Commercially, we understand the relationship is also very different to ours. In our case, EMC reps get commission and quota credit for any EMC storage that we sell (as long as they 'tag' the account appropriately in EMC's CRM system). We don't believe that's the case with Netezza, so there's really no incentive for EMC reps to help them in accounts - especially given the much lower comparative value of the EMC storage in a Netezza appliance.

DATAllegro Logo  
© 2008 DATAllegro Inc. All rights reserved. DATAllegro is a trademark of DATAllegro, Inc.
All other companies are the trademark or registered trademark of their respective owners.