views:

78

answers:

3

I have a couple of databases containing simple data which needs to be imported into a new format schema. I've come up with a flexible schema, but it relies on the critical data of the to older DBs to be stored in one table. This table has only a primary key, a foreign key (both int's), a datetime and a decimal field, but adding the count of rows from the two older DBs indicates that the total row count for this new table would be about 200,000,000 rows.

How do I go about dealing with this amount of data? It is data stretching back about 10 years and does need to be available. Fortunately, we don't need to pull out even 1% of it when making queries in the future, but it does all need to be accessible.

I've got ideas based around having multiple tables for year, supplier (of the source data) etc - or even having one database for each year, with the most recent 2 years in one DB (which would also contain the stored procs for managing all this.)

Any and all help, ideas, suggestions very, deeply, much appreciated,

Matt.

+1  A: 

Most importantly. consider profiling your queries and measuring where your actual bottlenecks are (try identifying the missing indexes), you might see that you can store everything in a single table, or that buying a few extra hard disks will be enough to get sufficient performance.

Now, for suggestions, have you considered partitioning? You could create partitions per time range, or one partition with the 1% commonly accessed and another with the 99% of the data.

This is roughly equivalent to splitting the tables manually by year or supplier or whatnot, but internally handled by the server.

On the other hand, it might make more sense to actually splitting the tables in 'current' and 'historical'.

Another possible size improvement is using an int (like an epoch) instead of a datetime and provide functions to convert from datetime to int, thus having queries like

SELECT * FROM megaTable WHERE datetime > dateTimeToEpoch('2010-01-23')

This size savings will probably have a cost performance wise if you need to do complex datetime queries. Although on cubes there is the standard technique of storing, instead of an epoch, an int in YYYYMMDD format.

Vinko Vrsalovic
Do you have any experience or numbers showing that storing ints and converting is more efficient than datetime? Have you considered smalldatetime if seconds aren't important?
tster
In my use case, seconds are important, and an int is half the size of a datetime type. I haven't measured performance though. I just expect it to be faster, simply because there's less data to move around. I also expect it to be very efficient for simple comparisons (datetime > 'date'), if you have to query a lot on more complex conditions, like hour of day, then you'll likely pay a price for the saved space (as you'll have to call the conversion function on each row)
Vinko Vrsalovic
I wouldn't just assume if I were you. I see no reason why an int greater than comparison would be faster than a datetime comparison. Especially if there is an index on for your query so even with millions of rows it would still only do a handful of comparisons. As for the size, smalldatetime is the same size as an int.
tster
@tster: As I said, I actually did it for space. And, although I see a reason why it might be faster, you are correct in that a) it might have been optimized away and b) I need to prove it. I'll edit to reflect it.
Vinko Vrsalovic
+1  A: 

What's the problem with storing this data in a single table? An enterprise-level SQL server like Microsoft SQL 2005 can handle it without much pain.

By the way, do not do tables per year, tables per supplier or other things like this. If you have to store similar set of items, you need one and one only table. Setting multiple tables to store the same type of things will cause problems, like:

  • Queries would be extremely difficult to write, and performance will be decreased if you have to query from multiple tables.

  • The database design will be very difficult to understand (especially since it's not something natural to store the same type of items in different places).

  • You will not be able to easily modify your database (maybe it's not a problem in your case), because instead of changing one table, you would have to change every table.

  • It would require to automate a bunch of tasks. Let's see you have a table per year. If a new record is inserted on 2011-01-01 00:00:00.001, will a new table be created? Will you check at each insert if you must create a new table? How it would affect performance? Can you test it easily?

If there is a real, visible separation between "recent" and "old" data (for example you have to use daily the data saved the last month only, and you have to keep everything older, but you do not use it), you can build a system with two SQL servers (installed on different machines). The first, highly available server, will serve to handle recent data. The second, less available and optimized for writing, will store everything else. Then, on schedule, a program will move old data from the first one to the second.

MainMa
I accept that the Enterprise version should do this easily. When you say "without much pain" do you mean it will just do it or that I would have to make certain configuration considerations?
Matt W
@MainMa I agree that partitioning may not be required for the reason given by @tster but your arguments against it assume he would be rolling his own partitioning scheme. SQL Server has built in support for partitioning. More so in the enterprise edition. Though partitioned views are available in the other editions.
Martin Smith
@Matt W: it depends on the context. IMHO, 2×10⁸ rows is not huge (see tster's response, which is more detailed). So assuming you have a good server and you do not have to make tens of thousands of queries per second, I don't think you need configuration changes. You can always fill the database with random data and test it to see the *real* result.
MainMa
@Martin Smith: I apologize, maybe my answer was not explicit enough, but I never encouraged creating his own partitioning scheme. When I wrote my answer, for me there were only two solutions: using one table only, and using two different servers on two different machines (to be able to optimize hardware too). Now, SQL partitioning scheme internally handled by the server is another way to do things, and it is a good one.
MainMa
+1  A: 

With such a small tuple size (2 ints, 1 datetime, 1 decimal) I think you will be fine having a single table with all the results in it. SQL server 2005 does not limit the number of rows in a table.

If you go down this road and run in to performance problems, then it is time to look at alternatives. Until then, I would plow ahead.

EDIT: Assuming you are using DECIMAL(9) or smaller, your total tuple size is 21 bytes which means that you can store the entire table in less than 4 GB of memory. If you have a decent server(8+ GB of memory) and this is the primary memory user, then the table and a secondary index could be stored in memory. This should ensure super fast queries after a slower warm-up time before the cache is populated.

tster