I'm importing Brazilian stock market data to a SQL Server database. Right now I have a table with price information from three kind of assets: stocks, options and forwards. I'm still in 2006 data and the table has over half million records. I have more 12 years of data to import so the table will exceed a million records for sure.
Now, my first approach for optimization was to keep the data to a minimum size, so I reduced the row size to an average of 60 bytes, with the following columns:
[Stock] [int] NOT NULL [Date] [smalldatetime] NOT NULL [Open] [smallmoney] NOT NULL [High] [smallmoney] NOT NULL [Low] [smallmoney] NOT NULL [Close] [smallmoney] NOT NULL [Trades] [int] NOT NULL [Quantity] [bigint] NOT NULL [Volume] [money] NOT NULL
Now, second approach for optimization was to make a clustered index. Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.
The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?
Right now with a half million records it's taking around 200ms to select 700 quotes from a specific asset. I believe this number will get higher as the table grows.
Now for a third approach I'm thinking in maybe splitting the table in three tables, each for a specific market (stocks, options and forwards). This will probably cut the table size by 1/3. Now, will this approach help or it doesn't matter too much? Right now the table has 50mb of size so it can fit entirely in RAM without much trouble.
Another approach would be using the partition feature of SQL Server. I don't know much about it but I think it's normally used when the tables are large and you can span across multiple disks to reduce I/O latency, am I right? Would partitioning be any helpful in this case? I believe I can partition the newest values (latest years) and oldest values in different tables, The probability of seeking for newest data is higher, and with a small partition it will probably be faster, right?
What would be other good approachs to make this the fastest possible? The mainly select usage of the table will be for seeking a specific range of records from a specific asset, like the latest 3 months of asset X. There will be another usages but this will be the most common, being possible executed by more than 3k users concurrently.