DATA-Beat A DATAllegro Blog on Data Warehouse Appliances

Site Map
DATAllegro Data Warehouse Appliances (949) 680-3000
 
From Terabytes to Petabytes

August 14, 2008

Law Suit by Cary A. Jardin against DATAllegro and Stuart O. Frost

As some of you will have seen, a law suit was filed yesterday against DATAllegro and me (at least I think it's me, but I don't actually have a middle initial!).

After analyzing the claims, we feel strongly that they're completely without merit and intend to vigorously defend our position. Given the prior art in this area, we're also considering asking the Patent Office to re-examine Jardin's patent.

July 24, 2008

DW Market Consolidation begins - with DATAllegro!

By the time you read this blog, Microsoft will have announced the acquisition of DATAllegro at their financial analyst meeting in Seattle. Read press release here

On a personal level, this marks the successful end of a twenty year journey building start-up companies. My first startup went through a very successful IPO, but didn't end well for me emotionally or financially. After taking a bit of time off to lick my wounds, I found myself trying to find funding for a startup in the 2000-2002 period - which wasn't much fun! But then, just as the VC community started to recover from the Internet 'bubble' in 2003, I came up with the vision for DATAllegro. Since that time, we've raised just under $65m in venture capital and created a hugely successful exit for my investors, my great team and last, but not least, me!

One of the things that pushed me towards the data warehouse market as an attractive target for a startup was its sheer size (over $20bn for hardware and software) and its importance to the major software and systems vendors. Due to a lack of interest from VCs in the Nineties, the database market in general had been pretty stagnant for 10-15 years. However, this was starting to change with the advent of some pretty decent open source DBMSs that we could use as a starting point. When put together, I felt these characteristics made the market ripe for disruption - and ultimately a successful exit via acquisition by one of the major vendors.

Given these market dynamics, it made sense to ensure our architecture was flexible and portable to any of the major database engines out there. This made us a very attractive target for Microsoft.

I started out this blog entry by saying this represents a successful exit for me and my team here at DATAllegro. Actually, that's not quite true. Yes, it's very successful financially and extremely satisfying professionally, but it's not really an exit. I'm intrigued by the opportunity to put the Microsoft brand and access to the market behind the vision, technology and people of DATAllegro. I've also found the people at Microsoft to be very intelligent, thoughtful and great to work with. Over the last few years, it's been incredibly frustrating to have prospects tell us that we have the best technology, vision and people, but that they can't buy from a startup - I think that will change radically under the Microsoft brand! As a result, I'm starting to think that it could be a long term home for me. It will certainly make a nice change from having to raise VC money every few months!

As soon as the acquisition closes, we'll start the work of moving our technology from Ingres & Linux to SQL Server and Windows. Our feasibility studies over the last few months indicate that SQL Server is a significant improvement in terms of performance - especially in key areas such as star joins, I/O throughput and in-memory operations. The engineering team here at DATAllegro is VERY excited about the next version of the product. Curt Monash provides some insightful analysis on the move to SQL Server in his blog as does William McKnight.

It will be interesting to see the impact this acquisition has on the rest of the market. Again, Curt provides some interesting thoughts as does Philip Howard. My guess is that the other incumbents will scramble to respond to Microsoft's pre-emptive strike and that this could lead to a few of the other startups being acquired. The ones left out will find life very hard over the next few years. I'll comment on this area more once things settle down a little.

So far we've had a great response to the acquisition from analysts, customers and the people here at DATAllegro. It will be fascinating to see how all this plays out over the next few years.

May 23, 2008

Netezza's EMC deal

Netezza just announced an OEM deal with EMC whereby they will ship an EMC AX4 storage unit with their appliances. Some analysts are talking about this in similar terms to our relationship with EMC. However, as Curt Monash points out, the arrangement is completely different.

On the technical front, the EMC storage is just used as a staging area for loads, backups and exports. This is the same as our backup and landing zone nodes, so Netezza is really just playing catch-up. In the Netezza architecture , data is still stored on disks that are embedded in each SPU. As a result, they are not taking advantage of EMC's greater reliability and fault tolerance - unlike in DATAllegro’s architecture. In addition, Netezza remains a highly proprietary platform.

Commercially, we understand the relationship is also very different to ours. In our case, EMC reps get commission and quota credit for any EMC storage that we sell (as long as they 'tag' the account appropriately in EMC's CRM system). We don't believe that's the case with Netezza, so there's really no incentive for EMC reps to help them in accounts - especially given the much lower comparative value of the EMC storage in a Netezza appliance.

May 19, 2008

Netezza

Netezza was the first disruptive data warehouse appliance vendor to hit the market.

When we started out in 2003, we took the position of saying that we were very similar to Netezza. However, over the last few years, our strategies have diverged a little. I therefore think it might be useful to explain the differences between the two products in an effort to help potential customers make a more informed choice.

Key Differences between Netezza & DATAllegro

Over the course of the next few months, I'll be going into detail on the differences between DATAllegro and Netezza. To kick things off, here's a brief summary:

The above spider chart rates DATAllegro and Netezza against other products in the market. The outer edge of the chart is "best in class". You might think that I've been self-serving in choosing the categories, but this is my blog after all! Let's go through each category so I can explain my reasoning:

Non-Proprietary HW

Netezza's snippet processing units (SPUs) are entirely proprietary, since they consist of a general purpose CPU, an FPGA and a hard disk drive on a custom blade. Netezza has occasionally made the argument that the underlying components are standard, off-the-shelf parts. While that's undoubtedly true, it's also true for pretty much any other piece of proprietary hardware. The only pieces of non-proprietary hardware in a Netezza appliance are the head units (standard HP servers) and the GigE network switches. As a result, Netezza scores very low in this category.

In contrast, a DATAllegro appliance uses completely standard servers (from Dell or Bull), storage (from EMC) and networking (from Cisco and Qlogic). Hence we get a best in class score in this area.

"DATAllegro: This firm's open, hardware-independent approach to the data warehouse appliance is catching on." Intelligent Enterprise, "36 Companies to Watch," January 2008.
Read Article

Non-Proprietary SW

Again, Netezza fares poorly in this category. While they started out by leveraging the Postgres open source database code, they've effectively rewritten most of the system at this point. The SPUs don't even run a mainstream OS such as Linux (although the head units do). Finally, Netezza’s software that runs on the head units is entirely proprietary.

Footprint

Netezza has basically one product line, with each rack holding 112 SPUs and 12.5TB of user data (with optional compression, this goes up to around 25TB).

DATAllegro takes a slightly different approach, with two different options in our range of data racks. Our DR1530 offers up to 30TB per rack and is therefore slightly better in terms of user data footprint than Netezza. We also offer a DR200 for very large-scale systems that offers 200TB of user data per rack. We're therefore better on this metric.

Watts per TB

Our DR1530 is slightly worse than Netezza on power consumption per TB, but our DR200 is significantly better, so I'll score us even on this point.

Price per TB

The latest information we have is that Netezza is list priced at around $100k per TB, which puts them in the mid-range of comparable offerings. Depending on the data racks used, our list prices are between $8k and $50k per TB, which is clearly substantially cheaper than Netezza.

Install Time

Netezza used to have a significant advantage over us in this area. However, our v3 product, which became available a little more than year ago, closed the gap and can now be installed in just a few hours.

Physical Design

Physical database design is very straightforward in Netezza. The DBA has to simply decide which column to distribute each table on across the SPUs. As data is loaded, the Netezza system automatically populates zone maps, which are effectively statistics for each 3MB page in the table. The query optimizer makes use of the Zone Maps when deciding which pages to read in order to satisfy each query.

There's no doubt that we lag behind Netezza a little in this area, although we are catching up fast and are already ahead of all of the other contenders. Like Netezza, we don’t generally require indexes (although they are available). Also like Netezza, the DBA must choose a distribution column to spread each table across the nodes. In addition, the DBA must decide how to set the multi-level partitioning up on our system. This is usually very straightforward and familiar to any experienced DBA. For example, in most installations, the fact tables will be date ranged and then hash partitioned by the foreign key of the largest dimension table. Other tables will typically just be hashed on their primary key. The process is simple and usually just takes a couple of hours. Changing the partitioning scheme is also generally very straightforward and very fast.

ETL & BI Compatibility

Both products are essentially the same in this area - i.e. they work with all mainstream BI and ETL tools.

Scalability

This is an area where DATAllegro has a big advantage. Whereas Netezza currently maxes out at 200TB, we already have production systems that exceed 400TB. In addition, our DR200 racks can be used to build data warehouses of more than 10PB of user data at very low cost.

Sequential Performance

Due to the way our compression code works, DATAllegro’s current products are optimized for performance under heavy concurrency. The end result is that we don't use the full power of the platform when running one query at a time. This can be a problem in proof of concepts against Netezza, since the first results people often look at are simple sequential query runs. There's no doubt that Netezza is very strong in this area. However, we don't feel this reflects real-world workloads.

Mixed Workload

In contrast to our performance under simple sequential query runs, our platform performs extremely well under a complex mixed workload with heavy concurrency.

There are several reasons for this:

  1. The compression code in our platform is optimized for use under concurrency.
  2. The Infiniband backbone in our appliances uses minimal CPU power for data movements, compared with the GbE connections used in rival appliances such as those from Netezza. As a result, there's more CPU power left over for running queries.
  3. InfiniBand can also move data around 10 times faster than GbE, which is a huge advantage for some complex queries.
  4. Our sophisticated use of multi-level partitioning and clever workload management allows us to run a mixture of short queries and long queries very efficiently and with very consistent query times as seen by users.
  5. Netezza's FPGAs run out of silicon real estate at around 16-20 concurrent queries. The end result is less consistency in query run times as the system comes under load and starts queuing.

The end result is that we're consistently beating other platforms when running a complex workload. Even Teradata can't compete with us on this metric.

"One [differentiator where DATAllegro] does well is in situations of mixed workloads, where as well as queries there are concurrent loads and even updates happening to the database." Andy on Enterprise Software, "A Lively Data Warehouse Appliance," February 2008.
Read Blog

Load Speeds

We've experienced consistently faster load performance than Netezza in all recent POCs, especially in near-real-time scenarios.

Distributed DW

In most large enterprises, data warehousing is a distributed problem - business units need to create data marts that meet their own requirements and SLAs while fitting in with data governance and other requirements that cut across the entire enterprise. Until recently, building an effective, large scale, distributed DW to match the shape and needs of the business was impossible, due to the low speed of data movement between centralized hubs and business unit specific spokes.

Last year, DATAllegro introduced its grid technology that provides centralized metadata management for a collection of appliances, together with very high speed, parallel data movement between the appliances. The technology has already been deployed successfully in a number of very large-scale enterprise DW implementations.

No other DW vendor has an answer to this challenge.

"All together, the hub-and-spoke grid approach is a concept that puts DATAllegro on a different playing field than other database vendors. Rather than trying to build the single fastest database system, this approach focuses on building the most effective enterprise data management infrastructure, which is ultimately more important than the single fastest system." Tom Briggs, Full Table Scan, "Getting to Know DATAllegro, Part II," May 2008.
Read Blog

In DBMS Analytics

Netezza stirred up the market last year when it announced the availability of in database analytics - effectively allowing third-party code to run on the SPUs, thereby taking advantage of the massively parallel architecture.

The amusing part of this is that user defined functions (UDFs) have been available in most database products for years - even in MPP systems such as DATAllegro. Hence, Netezza's 'innovation' in this area was more in the area of marketing than actual technology. Having said that, they have backed up their UDFs with some interesting tools and partnerships, so I'll give them a lead in this area.

Checking it out for yourself

The above analysis is obviously somewhat subjective and yes, I'll admit it, biased. However, there's an easy way to find out how the two products stack up against your requirements and that's to run a proof-of-concept using your own data and queries.

Adding DATAllegro to an existing POC adds very little effort to the overall process from the customer's perspective, since we do all of the work. Also, we've seen Netezza respond to competition from DATAllegro with some very heavy discounts, so you might save yourself a lot of money, no matter whom you choose in the end!

For materials related to concepts discussed in this blog entry, click the following links:

White Paper: Hub-and-Spoke: Getting the Data Warehouse Wheel Rolling
White Paper: Data Warehouse Appliances: The Benefits of an Open, Non-Proprietary Platform
White Paper: Using Grid Technology to Build a Hub-and-spoke Data Warehouse Architecture

May 1, 2008

Columns & Rows

Over the last couple of years, there's been a lot of talk about the advantages of column-oriented databases for data warehousing, with Michael Stonebraker from Vertica being particularly vociferous and bold in his claims that row-based databases are going to be completely replaced. As a vendor of a row-based database, I obviously have a vested interest in refuting his claims, but I'm going to try my best to be even-handed in discussing this issue.

The Claims
If you were new to database technology and read some of Stonebraker's articles, you might be forgiven for thinking that column-oriented databases were a completely new invention and were set to sweep row-oriented databases from the data warehousing market.

He claims that column-oriented databases are 10-50x faster than traditional row-oriented systems and offer significantly higher compression ratios, thereby bringing down the cost. Benchmarks against Oracle are usually put forward to back up these claims.

The Reality
The fact is that column-oriented databases have been around for some time. In the data warehousing market, long-established (but not very successful) examples include Sybase IQ and Sand.

There are some advantages of column-orientation for DW workloads. For example, data compresses slightly better when stored in columns (DATAllegro compresses between 2:1 and 6:1 depending on the content of the rows, whereas column-oriented systems claim 4:1 to 10:1). Also, some queries (i.e. those that only access a few columns) will perform better.

However, in most real-world implementations, these advantages don't make a great deal of difference.

At the end of the day, column orientation is just one approach to limiting the amount of data read for a given query. In effect, it's an extreme form of vertical partitioning of the data. In modern row-oriented systems such as DATAllegro, we use sophisticated horizontal partitioning to limit the number of rows read for each query. We're also working on clever usage of materialized view technology to limit the number of columns we need to read. The end result is very similar performance to that claimed by Stonebraker i.e. 10 to 50x that of traditional databases such as Oracle.

April 10, 2008

Who I am and why I'm here.

My name is Stuart Frost. I founded DATAllegro in 2003 and I've been the CEO of the company from the beginning.

As CEOs go, I'm pretty technical and still get heavily involved in specifying the architecture of the product, although I haven't written any of the DATAllegro code (much to the relief of the engineering team).

I have a degree in electronic engineering and started my career as a programmer in the telecoms and defense industries back in England, writing low level code for such things as phone exchanges and sonar and radar systems. While I didn't know it at the time, I guess this mix of software and hardware was an ideal grounding for what I do now—leading an appliance vendor.

stuart_frost.jpg
I started my first company, SELECT Software Tools, in 1988 and ran it as CEO & Founder for 10 years, through several rounds of funding and a Nasdaq IPO that brought me to the US. The VC that backed SELECT made a 26x return. After leaving that company in 1998, I took a couple of years off and missed most of the Internet boom. Great timing!

By late 2002, I was looking for my next startup idea. While at SELECT, I'd been involved in several large database design projects (SELECT was a software design tools company), so I started studying the DBMS market to see if there were any disruptive opportunities and quickly started focusing on the data warehousing sector.

The database market in general was a no-go area for VCs through the 1990s. After all, Oracle had won, hadn't they? This started to change with the introduction of a couple of strong open source databases i.e. MySQL and Postgres and accelerated when Netezza attacked the data warehousing market.

Netezza came to market with an interesting business model and value proposition:

  1. It leveraged an open source DBMS (Postgres) to reduce engineering costs and time to market.
  2. It used an appliance business model to create a tightly integrated software and hardware stack, thereby removing a significant area of complexity for DBAs and system admin staff.
  3. It shifted to sequential I/O from the more typical random I/O generated by the incumbents. This allowed the use of much larger and cheaper SATA disk drives and led to a highly competitive price/performance ratio.

However, there is a significant flaw in Netezza's strategy - in achieving #3, they created a highly proprietary hardware platform and, effectively, a proprietary software platform (with little of Postgres remaining).

Netezza secured its first few customers around the time DATAllegro was being founded. Looking at the Netezza architecture, I realized that there was an opportunity to create a similar value proposition while using a completely non-proprietary platform. Hence, my vision was to create a massively parallel DW appliance with an embedded, off-the-shelf open source DBMS (Ingres) running on Linux and using completely standard servers, networking and storage from major vendors.

DATAllegro

Almost five years after starting DATAllegro, I'm very pleased to see that my vision has become a reality. We now have a highly competitive DW appliance that uses an array of Dell servers (or Bull servers in Continental Europe), Cisco networking and EMC storage.

v3_node_architecture.jpg

Each server runs a highly tuned copy of the Ingres DBMS on SuSe Linux. Our proprietary software turns these separate databases into a massively parallel, shared nothing database system that offers incredibly good performance, especially under complex mixed workloads.

The appliance model is key to getting great performance. Tuning a large database using traditional approaches is extremely difficult and requires highly skilled DBAs. One of the main problems is the difficulty of understanding and tuning the interface between the DBMS software and the underlying OS and hardware platform. Database vendors such as Oracle and Microsoft have to build their software to run on any hardware. Hence there are a plethora of tuning parameters and options for the DBA and sys admins to setup. In the appliance model, we have the luxury of controlling the entire software and hardware stack from SQL to storage. As a result, we can hide all of the complexity.

Another very important aspect of performance is ensuring sequential reads under a complex workload. Traditional databases do not do a good job in this area - even though some of the management tools might tell you that they are! What we typically see is that the combination of RAID arrays and intervening storage infrastructure conspires to break even large reads by the database into very small reads against each disk. The end result is that most large DW installations have very large arrays of expensive, high-speed disks behind them - and still suffer from poor performance.

Through a lot of trial and error, smart engineering and code changes to the database engine, we've been able to create a platform that sustains sequential reads - even under very high levels of concurrency. This allows us to use relatively low-cost, high-capacity SATA disk drives and therefore to provide a very high price/performance ratio.

Exciting Times

It's an exciting time to be involved in the data warehousing market. It's rare to see a $30bn market go through such a rapid transition, with a few powerful incumbents under attack from several fast-moving, innovative disruptors.

In my next few blog entries, I'll be talking about the various players in the market and how I think they fit in and stack up. Don't worry, it won't be the usual self-serving PR blog - I'll be honest and straightforward about how I see the strengths and weaknesses of the various players, including DATAllegro.

DATAllegro Logo  
© 2008 DATAllegro Inc. All rights reserved. DATAllegro is a trademark of DATAllegro, Inc.
All other companies are the trademark or registered trademark of their respective owners.