tags:

views:

400

answers:

8

We have a 300 Gb+ data array we'd like to query as fast as possible. Traditional SQL databases (specifically, SQL Server) cannot handle this volume as effectively as we need (like, perform a select with 10-20 conditions in where clause in less than 10 sec), so I'm investigating other solutions for this problem.

I've been reading about NoSQL and this whole thing looks promising, but I'd prefer to hear from those who have used it in real life.

What can you suggest here?

EDIT to clarify what we're after.

We're a company developing an app whereby users can search for tours and perform bookings of said tours, paying for them with their plastic cards. This whole thing can surely be Russia-specific, so bear with me.

When a user logs on to the site, she is presented with a form similar to this:

alt text

Here, user selects where she leaves from and where she goes to, dates, duration and all that.

After hitting "Search" a request goes to our DB server, which, with cannot handle such load: queries include various kinds of parameters. Sharding doesn't work well either.

So what I'm after is a some kind of a pseudo-database, which can do lightning fast queries.

+10  A: 

I'm not sure I would agree that the traditional SQL databases can not handle these volumes, I can query through much larger datasets within those timeframes, but it has been designed specifically to handle that kind of work and placed on suitable hardware, specifically an IO subsystem that is designed to handle large data requests.

Andrew
+1  A: 

That really depends on what clauses you have in your WHERE and what kind of projection you need on your data.

It might be good enough to create the appropriate index on your table.

Also, even having an optimal data structure is of no use, if you have to read 100GB per query as that will take its time too.

David Schmitt
+1  A: 

NoSQL, as you may have read, is not a relational database.

It is a database which stores key-value pairs which you can traverse using a proprietary API.

This implies you will need to define the physical layout of the data yourself, as well as do any code optimizations.

I'm quite outdated on this, but several years ago I've participated in a BerkeleyDB project dealing with slightly less but still high volumes of data (about 100Gb).

It was perfectly OK for our needs.

Please also note, though it may seem obvious to you, that the queries can be optimized. Could you please post the query you use here?

Quassnoi
+8  A: 

If you want to do ad-hoc queries for reporting or analysis you're probably better off with something that will play nicely with off-the-shelf reporting tools. Otherwise you are likely to find yourself getting dragged off all the time to write little report programs to query the data. This is a strike against NoSQL type databases, but it may or may not be an issue depending on your circumstances.

300GB should not be beyond the capabilities of modern RDBMS platforms, even MS SQL Server. Some other options for large database queries of this type are:

  • See if you can use a SSAS cube and aggregations to mitigate your query performance issues. Usage-based optimiisation might get you adequate performance without having to get another database system. SSAS can also be used in shared-nothing configurations, allowing you to stripe your queries across a cluster of relatively cheap servers with direct-attach disks. Look at ProClarity for a front-end if you do go this way.

  • Sybase IQ is a RDBMS platform that uses an underlying data structure optimised for reporting queries. It has the advantage that it plays nicely with a reasonable variety of conventional reporting tools. Several other systems of this type exist, such as Red Brick, Teradata or Greenplum (which uses a modified version of PostgreSQL). The principal strike against these systems is that they are not exactly mass market items and can be quite expensive.

  • Microsoft has a shared-nothing version of SQL Server in the pipeline, which you might be able to use. However they've tied it to third party hardware manufacturers so you can only get it with dedicated (and therefore expensive) hardware.

  • Look for opportunities to build data marts with aggregated data to reduce the volumes for some of the queries.

  • Look at tuning your hardware. Direct attach SAS arrays and RAID controllers can put through streaming I/O of the sort used in table scans pretty quickly. If you partition your tables over a large number of mirrored pairs you can get very fast streaming performance - easily capable of saturating the SAS channels.

    Practically, you're looking at getting 10-20GB/sec from your I/O subsystem if you want the performance targets you describe, and it is certianly possible to do this without resorting to really exotic hardware.

ConcernedOfTunbridgeWells
+1  A: 

From what little I understand, traditional RDBMS are row based which optimizes for insertion speed. But retrieval speed optimization is best achieved with a column based storage system.

See Column oriented DBMS for a more thorough explanation than I could give

Peter M
+4  A: 

A properly set up SQL server should be able to handle data in the terrabytes without having performance problems. I have several friends who manage SQl Server databases that size with no perfomance issues.

Your problem may be one or more of these:

  • Inadequate server specs
  • Lack of good partitioning
  • Poor indexing
  • Poor database design
  • Poor query design including using tools like LINQ which may write poorly performing code for a database that size.

It assuredly is NOT the ability of SQL Server to handle these loads. If you have a databse that size you need to hire a professional dba with experience in optimizing large systems.

HLGEM
+1 Definately needs the staff / personnel to deal at the high end.
Andrew
+2  A: 

I expect a "conventional" database can do what you want, provided you structure your data appropriately for the queries you're doing.

You may find that in order to generate reports respectably, you need to summarise your data as it is generated (or loaded, transformed etc) and report off the summary data.

The speed of a SELECT is not related (directly, in most cases) to the number of conditions in the WHERE clause (usually), but it is to do with the explain plan and the number of rows examined. There are tools which will analyse this for you.

Ultimately, at 300G (which is not THAT big) you will probably need to keep some of your data on disc (=slow) at least some of the time so you want to start reducing the number of IO operations required. Reducing IO operations may mean making covering indexes, summary tables and copies of data with differing clustered indexes. This makes your 300G bigger, but who cares.

IO ops are king :)

Clearly doing these things is very expensive in terms of developer time, so you should start by throwing lots of hardware at the problem, and only try to fix it with software once that becomes insufficient. Lots of RAM is a start (but it won't be able to store > 10-20% of your data set at a time at current cost-effective levels) Even SSDs are not that expensive these days.

MarkR
A: 

You should try Kickfire, which is a hardware optimised version of MySql and will breeze through your query.

http://www.kickfire.com/

ps: I do NOT work for kickfire

Zubair