I have 2 files that I'd like to import into MS SQL. The first file is 2.2 GB and the second file is 24 GB worth of data. (if you are curious: this is a poker related look up table)
Importing them into MS SQL is not a problem. Thanks to SqlBulkCopy I was able to import the first file in just 10 minutes. My problem is, I don't know how the actual table schema should look like to allow me to do some very fast queries. My first naive attempt looks like this:
CREATE TABLE [dbo].[tblFlopHands]( [hand_id] [int] IDENTITY(1,1) NOT NULL, [flop_index] [smallint] NULL, [hand_index] [smallint] NULL, [hs1] [real] NULL, [ppot1] [real] NULL, [hs2] [real] NULL, [ppot2] [real] NULL, [hs3] [real] NULL, [ppot3] [real] NULL, [hs4] [real] NULL, [ppot4] [real] NULL, [hs5] [real] NULL, [ppot5] [real] NULL, [hs6] [real] NULL, [ppot6] [real] NULL, [hs7] [real] NULL, [ppot7] [real] NULL, [hs8] [real] NULL, [ppot8] [real] NULL, [hs9] [real] NULL, [ppot9] [real] NULL, CONSTRAINT [PK_tblFlopHands] PRIMARY KEY CLUSTERED ( [hand_id] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]
The flop index is a value from 1 to 22100 (the first 3 common cards in texas hold'em, 52 choose 3). Each flop index has a hand_index from 1 to 1176 (49 choose 2). So in total there are 25,989,600 rows in this table.
Doing a query with my above "schema" took approx. 25 seconds. After some googling I found out that the SQL server was doing a table scan, which is obviously a bad thing. I ran the "Database Engine Tuning Advisor" and it recommended to create an index on the flop_index column (makes sense). After creating the index, the required disk spaces for the DB exactly doubled up! (plus the log LDF file grew by 2.6 GB) But after the indexing, a query took only a couple of ms.
Now my question is, how should I do it the right way? I've never worked with such massive data, the databases I created before were a joke.
Some things to note: After importing the data into MS SQL there will never ever be an insert or update of the data, just select's. So I'm wondering if I even need a primary key?
EDIT: I'm providing some more info to make my question more clear:
1) I'll never ever use the hand_id. I only put it there because someone told me some long time ago that I should always create a primary key for each table.
2) There will be basically only one query I will use:
SELECT hand_index, hs1, ppot1, hs2, ppot2, hs3, ppot3, hs4, ppot4, hs5, ppot5, hs6, ppot6, hs7, ppot7, hs8, ppot8, hs9, ppot9 WHERE flop_index = 1...22100
This query will always return 1176 rows with the data I need.
EDIT2: Just to be more specific: Yes this is static data. I have this data in a binary file. I have written a program to query this file with the data I need in just a few milliseconds. The reason I want this data in a database is that I want to be able to query the data from different computers in my network without the need to copy 25 GB on each computer.
HS means handstrength, it tells you the current hand strength of your hole cards combined with the flop or turn cards. ppot means positive potential, this is the chance that your hand will be ahead once the next common card is dealt. hs1 to 9 is the handstrength against 1 to 9 opponents. Same for ppot. Calculating ppot on the fly is very cpu intensive and takes a couple of minutes to calculate. I want to create a poker analysis program which gives me a list of every possible hole card combiniation on any give flop/turn with their hs/ppot.