What's your approach for optimizing large tables (+1M rows) on SQL Server?

views:

2267

answers:

+5 Q:

What's your approach for optimizing large tables (+1M rows) on SQL Server?

I'm importing Brazilian stock market data to a SQL Server database. Right now I have a table with price information from three kind of assets: stocks, options and forwards. I'm still in 2006 data and the table has over half million records. I have more 12 years of data to import so the table will exceed a million records for sure.

Now, my first approach for optimization was to keep the data to a minimum size, so I reduced the row size to an average of 60 bytes, with the following columns:

[Stock] [int] NOT NULL
[Date] [smalldatetime] NOT NULL
[Open] [smallmoney] NOT NULL
[High] [smallmoney] NOT NULL
[Low] [smallmoney] NOT NULL
[Close] [smallmoney] NOT NULL
[Trades] [int] NOT NULL
[Quantity] [bigint] NOT NULL
[Volume] [money] NOT NULL

Now, second approach for optimization was to make a clustered index. Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.

The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?

Right now with a half million records it's taking around 200ms to select 700 quotes from a specific asset. I believe this number will get higher as the table grows.

Now for a third approach I'm thinking in maybe splitting the table in three tables, each for a specific market (stocks, options and forwards). This will probably cut the table size by 1/3. Now, will this approach help or it doesn't matter too much? Right now the table has 50mb of size so it can fit entirely in RAM without much trouble.

Another approach would be using the partition feature of SQL Server. I don't know much about it but I think it's normally used when the tables are large and you can span across multiple disks to reduce I/O latency, am I right? Would partitioning be any helpful in this case? I believe I can partition the newest values (latest years) and oldest values in different tables, The probability of seeking for newest data is higher, and with a small partition it will probably be faster, right?

What would be other good approachs to make this the fastest possible? The mainly select usage of the table will be for seeking a specific range of records from a specific asset, like the latest 3 months of asset X. There will be another usages but this will be the most common, being possible executed by more than 3k users concurrently.

+5 A:

A million records really isn't that big. It does sound like it's taking too long to search though - is the column you're searching against indexed?

As ever, the first port of call should be the SQL profiler and query plan evaluator. Ask SQL Server what it's going to do with the queries you're interested in. I believe you can even ask it to suggest changes such as extra indexes.

I wouldn't start getting into partitioning etc just yet - as you say, it should all comfortably sit in memory at the moment, so I suspect your problem is more likely to be a missing index.

Jon Skeet 2008-12-03 16:32:18

Yes, it's indexed but I forgot an extremely important point: I'm still inserting data, that is probably affecting the search a lot. I know it's not that big but it will be queried very often.

Augusto Radtke 2008-12-03 16:39:31

And have you profiled it yet and checked the query plan?Do you always need to be able to query the most recent data? If not, you may find it best to insert into some unindexed tables, then batch up the inserts during quiet periods.

Jon Skeet 2008-12-03 16:49:45

+1 A:

I work for a school district and we have to track attendance for each student. It's how we make our money. My table that holds the daily attendance mark for each student is currently 38.9 Million records large. I can pull up a single student's attendance very quickly from this. We keep 4 indexes (including the primary key) on this table. Our clustered index is student/date which keeps all the student's records ordered by that. We've taken a hit on inserts to this table with regards to that in the event that an old record for a student is inserted, but it is a worthwhile risk for our purposes.

With regards to select speed, I would certainly take advantage of caching in your circumstance.

Nick DeVore 2008-12-03 16:35:04

+3 A:

Check your execution plan on that query first. Make sure your indexes are being used. I've found that. A million records is not a lot. To give some perspective, we had an inventory table with 30 million rows in it and our entire query which joined tons of tables and did lots of calculations could run in under 200 MS. We found that on a quad proc 64 bit server, we could have signifcantly more records so we never bothered partioning.

You can use SQL Profier to see the execution plan, or just run the query from SQL Management Studio or Query Analyzer.

JoshBerke 2008-12-03 16:35:18

The execution plan shows it's using the clustered index quite fine, but I forgot an extremely important fact, I'm still inserting data! The insert is probably locking the table too often. There is a way we can see this bottleneck?

The execution plan doesn't seems to show anything about lock issues.

Right now this data is only historical, when the importing process is finished the inserts will stop and be much less often. But I will have a larger table for real-time data soon, that will suffer from this constant insert problem and will be bigger than this table. So any approach on optimizing this kind of situation is very welcome.

Augusto Radtke 2008-12-03 16:47:01

Run your select with NOLOCK to skip past any locks. Not that I'd recommend that for production, but you can use it to test for locking issues. Profiler will also show you locks, but it can be a bear to sort out.

Mark Brackett 2008-12-03 17:03:06

How quickly are you inserting? It shouldn't take long to put in 1 million records if they are batched. If you are doing them one at a time, there will be no interference.

le dorfier 2008-12-14 00:02:34

+2 A:

reevaluate the indexes... thats the most important part, the size of the data doesn't really matter, well it does but no entirely for speed purposes.

My recommendation is re build the indexes for that table, make a composite one for the columns you´ll need the most. Now that you have only a few records play with the different indexes otherwise it´ll get quite annoying to try new things once you have all the historical data in the table.

After you do that review your query, make the query plan evaluator your friend, and check if the engine is using the right index.

I just read you last post, theres one thing i don't get, you are quering the table while you insert data? at the same time?. What for? by inserting, you mean one records or hundred thousands? How are you inserting? one by one?

But again the key of this are the indexes, don't mess with partitioning and stuff yet.. specially with a millon records, thats nothing, i have tables with 150 millon records, and returning 40k specific records takes the engine about 1500ms...

Alan FL 2008-12-03 16:52:08

The insert procedure right now is very loose. I'm not bulk inserting, so I believe that's the main problem. It's very nice to read numbers from sizes against select time, I didn't have the measure to know what is fast or not.

Augusto Radtke 2008-12-03 16:58:43

another solution would be to create an historical table for each year, and put all this tables in an historical database, fill all those in and then create the appropriate indexes for them. Once you are done with this you won't have to touch them ever again. Why would you have to keep on inserting data? To query all those tables you just "union all" them :p

The current year table should be very different to this historical tables. For what i understood you are planning to insert records on the go?, i'd plan something different like doing a bulk insert or something similar every now and then along the day. Of course all this depends on what you want to do.

The problems here seems to be in the design. I'd go for a new design. The one you have now for what i understand its not suitable.

Alan FL 2008-12-03 17:03:55

Denormalization is bad advice at this size database, and by year would be the wrong way anyway.

le dorfier 2008-12-14 00:00:46

+1 A:

You've mentioned that your primary key is a compound on (Stock, Date), and clustered. This means the table is organised by Stock and then by Date. Whenever you insert a new row, it has to insert it into the middle of the table, and this can cause the other rows to be pushed out to other pages (page splits).

I would recommend trying to reverse the primary key to (Date, Stock), and adding a non-clustered index on Stock to facilitate quick lookups for a specific Stock. This will allow inserts to always happen at the end of the table (assuming you're inserting in order of date), and won't affect the rest of the table, and lesser chance of page splits.

Jim McLeod 2008-12-03 22:41:42

Wrong about reversing the clustered index - queries are for multiple stocks at a time, not multiple dates at a time. And inserts will be totally negligible at 1 new record per stock per day.

le dorfier 2008-12-14 00:00:00

Actually I'll get much more than 1 new record per stock per day as I'll start storing every stock trade, perhaps this is still a good advise?

Augusto Radtke 2009-01-05 00:37:40

Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.

The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?

Indexes in SQL Server are always sorted by column order in index. So an index on [stock,date] will first sort on stock, then within stock on date. An index on [date, stock] will first sort on date, then within date on stock.

When doing a query, you should always include the first column(s) of an index in the WHERE part, else the index cannot be efficiently used.

For your specific problem: If date range queries for stocks are the most common usage, then do the primary key on [date, stock], so the data will be stored sequencially by date on disk and you should get fastest access. Build up other indexes as needed. Do index rebuild/statistics update after inserting lots of new data.

MicSim 2008-12-13 22:58:18

SQL Server (and any other SQL engine) is way past smart enough to reorder your fields in order to match the indexes, so your WHERE clause info is questionable. Stats are updated automatically.

le dorfier 2008-12-13 23:57:47

+4 A:

At 1 million records, I wouldn't consider this a particularly large table needing unusual optimization techniques such as splitting the table up, denormalizing, etc. But those decisions will come when you've tried all the normal means that don't affect your ability to use standard query techniques.

Now, second approach for optimization was to make a clustered index. Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.

The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?

It's logically true - the clustered index defines the logical ordering of the records on the disk, which is all you should be concerned about. SQL Server may forego the overhead of sorting within a physical block, but it will still behave as if it did, so it's not significant. Querying for one stock will probably be 1 or 2 page reads in any case; and the optimizer doesn't benefit much from unordered data within a page read.

Right now with a half million records it's taking around 200ms to select 700 quotes from a specific asset. I believe this number will get higher as the table grows.

Not necessarily significantly. There isn't a linear relationship between table size and query speed. There are usually a lot more considerations that are more important. I wouldn't worry about it in the range you describe. Is that the reason you're concerned? 200 ms would seem to me to be great, enough to get you to the point where your tables are loaded and you can start doing realistic testing, and get a much better idea of real-life performance.

Now for a third approach I'm thinking in maybe splitting the table in three tables, each for a specific market (stocks, options and forwards). This will probably cut the table size by 1/3. Now, will this approach help or it doesn't matter too much? Right now the table has 50mb of size so it can fit entirely in RAM without much trouble.

No! This kind of optimization is so premature it's probably stillborn.

Another approach would be using the partition feature of SQL Server.

Same comment. You will be able to stick for a long time to strictly logical, fully normalized schema design.

What would be other good approachs to make this the fastest possible?

The best first step is clustering on stock. Insertion speed is of no consequence at all until you are looking at multiple records inserted per second - I don't see anything anywhere near that activity here. This should get you close to maximum efficiency because it will efficiently read every record associated with a stock, and that seems to be your most common index. Any further optimization needs to be accomplished based on testing.

le dorfier 2008-12-13 23:54:54

ansaurus

tags:

views:

answers:

What's your approach for optimizing large tables (+1M rows) on SQL Server?

related questions