views:

136

answers:

3

This question is related to another:
http://stackoverflow.com/questions/108445/will-having-multiple-filegroups-help-speed-up-my-database

The software we're developing is an analytical tool that uses MS SQL Server 2005 to store relational data. Initial analysis can be slow (since we're processing millions or billions of rows of data), but there are performance requirements on recalling previous analyses quickly, so we "save" results of each analysis.

Our current approach is to save analysis results in a series of "run-specific" tables, and the analysis is complex enough that we might end up with as many as 100 tables per analysis. Usually these tables use up a couple hundred MB per analysis (which is small compared to our hundreds of GB, or sometimes multiple TB, of source data). But overall, disk space is not a problem for us. Each set of tables is specific to one analysis, and in many cases this provides us enormous performance improvements over referring back to the source data.

The approach starts to break down once we accumulate enough saved analysis results -- before we added more robust archive/cleanup capability, our testing database climbed to several million tables. But it's not a stretch for us to have more than 100,000 tables, even in production. Microsoft places a pretty enormous theoretical limit on the size of sysobjects (~2 billion), but once our database grows beyond 100,000 or so, simple queries like CREATE TABLE and DROP TABLE can slow down dramatically.

We have some room to debate our approach, but I think that might be tough to do without more context, so instead I want to ask the question more generally: if we're forced to create so many tables, what's the best approach for managing them? Multiple filegroups? Multiple schemas/owners? Multiple databases?

Another note: I'm not thrilled about the idea of "simply throwing hardware at the problem" (i.e. adding RAM, CPU power, disk speed). But we won't rule it out either, especially if (for example) someone can tell us definitively what effect adding RAM or using multiple filegroups will have on managing a large system catalog.

A: 

This seems to be a very interesting problem/application that you are working with. I would love to work on something like this. :)

You have a very large problem surface area, and that makes it hard to start helping. There are several solution parameters that are not evident in your post. For example, how long do you plan to keep the run analysis tables? There's a LOT other questions that need to be asked.

You are going to need a combination of serious data warehousing, and data/table partitioning. Depending on how much data you want to keep and archive you may need to start de-normalizing and flattening the tables.

This would be pretty good case where contacting Microsoft directly can be mutually beneficial. Microsoft gets a good case to show other customers, and you get help directly from the vendor.

hectorsosajr
+1  A: 

Are the tables all different structures? If they are the same structure you might get away with a single partitioned table.

If they are different structures, but just subsets of the same set of dimension columns, you could still store them in partitions in the same table with nulls in the non-applicable columns.

If this is analytic (derivative pricing computations perhaps?) you could dump the results of a computation run to flat files and reuse your computations by loading from the flat files.

ConcernedOfTunbridgeWells
+1  A: 

Without first seeing the entire system, my first recommendation would be to save the historical runs in combined tables with a RunID as part of the key - a dimensional model may also be relevant here. This table can be partitioned for improvement, which will also allow you to spread the table into other filegroups.

Another possibility it to put each run in its own database and then detach them, only attaching them as needed (and in read-only form)

CREATE TABLE and DROP TABLE are probably performing poorly because the master or model databases are not optimized for this kind of behavior.

I also recommend talking to Microsoft about your choice of database design.

Cade Roux