views:

149

answers:

5

I am working on a project that must store very large datasets and associated reference data. I have never come across a project that required tables quite this large. I have proved that at least one development environment cannot cope at the database tier with the processing required by the complex queries against views that the application layer generates (views with multiple inner and outer joins, grouping, summing and averaging against tables with 90 million rows).

The RDBMS that I have tested against is DB2 on AIX. The dev environment that failed was loaded with 1/20th of the volume that will be processed in production. I am assured that the production hardware is superior to the dev and staging hardware but I just don't believe that it will cope with the sheer volume of data and complexity of queries.

Before the dev environment failed, it was taking in excess of 5 minutes to return a small dataset (several hundred rows) that was produced by a complex query (many joins, lots of grouping, summing and averaging) against the large tables.

My gut feeling is that the db architecture must change so that the aggregations currently provided by the views are performed as part of an off-peak batch process.

Now for my question. I am assured by people who claim to have experience of this sort of thing (which I do not) that my fears are unfounded. Are they? Can a modern RDBMS (SQL Server 2008, Oracle, DB2) cope with the volume and complexity I have described (given an appropriate amount of hardware) or are we in the realm of technologies like Google's BigTable?

I'm hoping for answers from folks who have actually had to work with this sort of volume at a non-theoretical level.

The nature of the data is financial transactions (dates, amounts, geographical locations, businesses) so almost all data types are represented. All the reference data is normalised, hence the multiple joins.

+1  A: 

If that is only 1/20 of your data, you almost surely need to look into more scalable and efficient solutions, such as Google's Big Table. Have a look at NoSQL

I personally think that MongoDB is an awesome inbetween of NoSQL and RDMS. It isn't relational, but it provides a lot more features than a simple document store.

Earlz
NoSQL isn't more "scalable" than RDBMSes, no matter what the Digg loonies say.
Aaronaught
"many joins, lots of grouping, summing and averaging" <-- Hmm.. I don't think NoSQL solutions provide such features.
Billy ONeal
@Billy, well Relational designs of course will not work in a Non-relational database. That's like planning to tackle a big Java project while saying "I'm going to write this line for line like this Cobol reference implementation" It's of course not going to work. He would need a redesign of how the data is stored and how he accesses it.
Earlz
@Earlz: Yes. I just wanted to point out that sure, NoSQL can give you higher performance than SQL can. But if you need the kinds of reporting features like this than you are going to do nothing but shift the bottleneck from the SQL server into your application, which is no help at all. There is no silver bullet. NoSQL is excellent for websites like Digg which don't capitalize on SQL's reporting power, perhaps 80-90% of websites. But for the application the OP needs, NoSQL would be a mistake.
Billy ONeal
@Billy Most of the time an application can be more easily scaled than a database can be though. For instance, a web application, you just add another server to handle requests, trivial to scale.
Earlz
@Earlz: For 80-90% of websites, I agree. But NoSQL would still be a mistake for the OP's application as described.
Billy ONeal
"For instance, a web application, you just add another server to handle requests" -- and where does this server get its data from? There still needs to be a way for all of these servers to share data and possibly state. If you're going to claim that "NoSQL" - which is nothing more than an umbrella term for a disparate array of mostly open-source non-relational databases - is somehow easier to scale than a very expensive, high-end DBMS (DB2), then you'd best have some evidence to back that up. Most RDBMSes can easily be scaled up by simply throwing more hardware at the problem.
Aaronaught
@Aar, if the web application needs to have so much state that you can't just use a table in the database for it, then your doing it wrong. Also, if you can find comprehensive benchmarks between the two I'd like to see them.. I'm having trouble finding anything proving or disproving.
Earlz
@Earlz: Exactly. You need a database to store the state/data. Adding more web servers does not change that. NoSQL only becomes an interesting solution in terms of scalability when scaling out by adding more servers becomes cheaper and easier than scaling up by adding more memory/CPUs/disks. That might be true for Google and Digg but they are the exception, not the rule. As for benchmarks, you're right, there aren't any, which is exactly why you should be mistrustful of claims of superior scalability.
Aaronaught
And with respect to "relational designs of course will not work in a non-relational database" - building a major business app without a relational database is not a trivial endeavour. It's vaguely like building a massive data warehouse; you have to manually implement every single query and aggregate.
Aaronaught
+1  A: 

In dimensional (Kimball methodology) models in our data warehouse on SQL Server 2005, we regularly have fact tables with that many rows just in a single month partition.

Some things are instant and some things take a while, it depends on the operation and how many stars are being combined and what's going on.

The same models perform poorly on Teradata, but it is my understanding that if we re-model in 3NF, Teradata parallelization will work a lot better. The Teradata installation is many times more expensive than the SQL Server installation, so it just goes to show how much of a difference modeling and matching your data and processes to the underlying feature set matters.

Without knowing more about your data, and how it's currently modeled and what indexing choices you've made it's hard to say anything more.

Cade Roux
Thanks. There is definitely room for improvement in our model and that task will fall to someone with experience in that arena. What's important to me is that there are examples of others using a commercial RDBMS to warehouse volumes of data similar to ours. I appreciate the response.
grenade
+2  A: 

I work with a few SQL Server 2008 databases containing tables with rows numbering in the billions. The only real problems we ran into were those of disk space, backup times, etc. Queries were (and still are) always fast, generally in the < 1 sec range, never more than 15-30 secs even with heavy joins, aggregations and so on.

Relational database systems can definitely handle this kind of load, and if one server or disk starts to strain then most high-end databases have partitioning solutions.

You haven't mentioned anything in your question about how the data is indexed, and 9 times out of 10, when I hear complaints about SQL performance, inadequate/nonexistent indexing turns out to be the problem.

The very first thing you should always be doing when you see a slow query is pull up the execution plan. If you see any full index/table scans, row lookups, etc., that indicates inadequate indexing for your query, or a query that's written so as to be unable to take advantage of covering indexes. Inefficient joins (mainly nested loops) tend to be the second most common culprit and it's often possible to fix that with a query rewrite. But without being able to see the plan, this is all just speculation.

So the basic answer to your question is yes, relational database systems are completely capable of handling this scale, but if you want something more detailed/helpful then you might want to post an example schema / test script, or at least an execution plan for us to look over.

Aaronaught
Thanks. This is exactly the sort of response I was looking for. Real world confidence from practical experience. It gives me hope.
grenade
+2  A: 

It looks like you're calculating the same data over and over again from normalized data. One way to speed up processing in cases like this is to keep SQL with it's nice reporting and relationships and consistency and such, and use a OLAP Cube which is calculated every x amount of minutes. Basically you build a big table of denormalized data on a regular basis which allows quick lookups. The relational data is treated as the master, but the Cube allows quick precalcuated values to be retrieved from the database at any one point.

Billy ONeal
A: 

90 million rows should be about 90GB, thus your bottleneck is disk. If you need these queries rarely, run them as is.

If you need these queries often, you have to split your data and precompute your gouping summing and averaging on the part of your data that doesn't change (or didn't change since last time).

For example if you process historical data for the last N years up to and including today, you could process it one month (or week, day) at a time and store the totals and averages somewhere. Then at query time you only need to reprocess period that includes today.

Some RDBMS give you some control over when views are updated (at select, at source change, offline), if your complicated grouping summing and averaging is in fact simple enough for the database to understand correctly, it could, in theory, update a few rows in the view at every insert/update/delete in your source tables in reasonable time.

Dima Tisnek