views:

255

answers:

5

I have 2 files that I'd like to import into MS SQL. The first file is 2.2 GB and the second file is 24 GB worth of data. (if you are curious: this is a poker related look up table)

Importing them into MS SQL is not a problem. Thanks to SqlBulkCopy I was able to import the first file in just 10 minutes. My problem is, I don't know how the actual table schema should look like to allow me to do some very fast queries. My first naive attempt looks like this:

CREATE TABLE [dbo].[tblFlopHands](
    [hand_id] [int] IDENTITY(1,1) NOT NULL,
    [flop_index] [smallint] NULL,
    [hand_index] [smallint] NULL,
    [hs1] [real] NULL,
    [ppot1] [real] NULL,
    [hs2] [real] NULL,
    [ppot2] [real] NULL,
    [hs3] [real] NULL,
    [ppot3] [real] NULL,
    [hs4] [real] NULL,
    [ppot4] [real] NULL,
    [hs5] [real] NULL,
    [ppot5] [real] NULL,
    [hs6] [real] NULL,
    [ppot6] [real] NULL,
    [hs7] [real] NULL,
    [ppot7] [real] NULL,
    [hs8] [real] NULL,
    [ppot8] [real] NULL,
    [hs9] [real] NULL,
    [ppot9] [real] NULL,
 CONSTRAINT [PK_tblFlopHands] PRIMARY KEY CLUSTERED 
(
    [hand_id] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

The flop index is a value from 1 to 22100 (the first 3 common cards in texas hold'em, 52 choose 3). Each flop index has a hand_index from 1 to 1176 (49 choose 2). So in total there are 25,989,600 rows in this table.

Doing a query with my above "schema" took approx. 25 seconds. After some googling I found out that the SQL server was doing a table scan, which is obviously a bad thing. I ran the "Database Engine Tuning Advisor" and it recommended to create an index on the flop_index column (makes sense). After creating the index, the required disk spaces for the DB exactly doubled up! (plus the log LDF file grew by 2.6 GB) But after the indexing, a query took only a couple of ms.

Now my question is, how should I do it the right way? I've never worked with such massive data, the databases I created before were a joke.

Some things to note: After importing the data into MS SQL there will never ever be an insert or update of the data, just select's. So I'm wondering if I even need a primary key?

EDIT: I'm providing some more info to make my question more clear:

1) I'll never ever use the hand_id. I only put it there because someone told me some long time ago that I should always create a primary key for each table.

2) There will be basically only one query I will use:

SELECT hand_index, hs1, ppot1, hs2, ppot2, hs3, ppot3, hs4, ppot4, hs5, ppot5, hs6, ppot6, hs7, ppot7, hs8, ppot8, hs9, ppot9 WHERE flop_index = 1...22100

This query will always return 1176 rows with the data I need.

EDIT2: Just to be more specific: Yes this is static data. I have this data in a binary file. I have written a program to query this file with the data I need in just a few milliseconds. The reason I want this data in a database is that I want to be able to query the data from different computers in my network without the need to copy 25 GB on each computer.

HS means handstrength, it tells you the current hand strength of your hole cards combined with the flop or turn cards. ppot means positive potential, this is the chance that your hand will be ahead once the next common card is dealt. hs1 to 9 is the handstrength against 1 to 9 opponents. Same for ppot. Calculating ppot on the fly is very cpu intensive and takes a couple of minutes to calculate. I want to create a poker analysis program which gives me a list of every possible hole card combiniation on any give flop/turn with their hs/ppot.

A: 

This is a very common question. When you create indexes it potentially reduces the time required for queries but increases the time required for updates/inserts and also increase the amount of disk space required per record.

You need to decide for each column if the index offers a performance boost for your queries and if it warrants the impact to insert/update performance and disk space utilization.

As an alternative to indexes, you might be able to utilize an OLAP cube. If your query is producing an aggregate or applying computations then you might want to consider performing the query nightly and storing the results in a different table. You can run simpler queries against the smaller table and achieve the same result with less impact on performance.

Mayo
A: 

How you do your indexes and primkeys depends. If you just want to analyze the data and if you're pretty sure subsequent DML commands will only be SELECTs (no INSERTs), then removing the PK should be fine. In fact, the hand_id column is an IDENTITY (auto-increment) column, meaning that SQL Server manages that value anyway (in fact, you can't insert values into that column without going to the extra trouble of switching on IDENTITY_INSERT mode prior to beginning your INSERT statements, IIRC).

Be wary of evolving needs for this database, of course. Should needs change, then you should consider constraints/indexes/keys.

If data mining is a consideration in the future, consider using Microsoft's SSAS (Analysis Services).

UPDATE: After reading mayo's reply, I agree that indexes (purely for speed, not constraint enforcement) are advisable for subsequent queries (recall that indexes speed up read operations but make inserts/updates take longer, typically). Since your goal is to do a single bulk insert followed by SELECT queries, you could do your bulk insert, then add the necessary indexes to your database on columns that are likely candidates in your queries.

Garrett
Actually I won't use the hand_id at all. I created the PK because I've been taught to always create a PK in every table. Also, in my scenario there will never ever be any inserts or updates at all once the data is inserted. Also I will always to a query with the hand_index, so each query will return 1176 rows. So is it normal that after creating an index on the hand_index column that the DB size doubles up? I thought this is odd, but if it works like this then let it be it.
Simon
+1  A: 

To answer your question about needing a primary key - with only the information you provided in the question:

Based on your table schema, you might as well keep it there. If you remove that identity column, you'd also be removing your clustered index. Your clustered index value (4 bytes) is stored as the pointer in each non-clustered index row. By removing that clustered index, you'd be leaving the table as a heap - and SQL will create an 8 byte RID (row identifier) for each row in the table, and use that as the pointer in the non-clustered index instead. So, in your case, based on the schema you've provided in the question - you could potentially INCREASE the size of your non-clustered indexes, and in the end slow them down.

With that all said - based on the queries that you could be running (and their usage patterns) that weren't included in the question - evaluating your clustered index to be something other than an identity column could be in line as well.

Scott Ivey
+1  A: 

Well you could break up the table into smaller tables if for example the hs(X) and ppot(X) need to grow past nine.

This is what you have:

[hand_id] [int] IDENTITY(1,1) NOT NULL,
    [flop_index] [smallint] NULL,
    [hand_index] [smallint] NULL,
    [hs1] [real] NULL,
    [ppot1] [real] NULL,
    etc...

You could break it up into 2 tables (maybe 3 if you need to)

Table hand: (EXAMPLE)
[hand_id] [int] IDENTITY(1,1) NOT NULL,
    [flop_index] [smallint] NULL,
    [hand_index] [smallint] NULL


Table hs_ppot (EXAMPLE)
[hand_id] [int] IDENTITY(1,1) NOT NULL,
[hs] [real] NULL,
    [ppot] [real] NULL

Then you could reference by hand_id in each table. Just a though.

BTW what is hs and ppot?

Phill Pafford
hs means Handstrength and ppot means "Positive Potential"
Simon
ok thanks, not big into poker
Phill Pafford
I'm actually trying to break the data into multiple tables, i'll let you know how it works out. Unfortunately I'm not big into SQL ;)
Simon
A: 

Let me preface my response by saying that putting every possible combination in a database feels wrong. I'll get to why in a minute.

I'd start with a table called Cards. There would be 1 record for every possible card and it would include fields for Suit, Face value, rank and yes, a CardID as a primary key. Also index the suit, and face value.

If you want to table out every possible Hold'em hand, then I would make separate tables for the pocketCards(pocketID, pCardID1, pCardID2), flopCards(flopID, fCardID1, fCardID2, fCardID3) then a table for the TurnAndRiver(turnAndRiverID, turnCardID, riverCardID). Then a Hand table with (handID, pocketID, flopID, turnAndRiverID, handScore).

HandScore would be a calculated field run off of a table or scalar value function.

By separating out those bits, you avoid a great deal of the duplication, but you will still have to worry about card selection and overlap.

Ideally, I would forgo the hand tables and calculate the hand and score in what ever application I was building to consume this data.

Putting too much of your logic in the database may make it hard to adapt when the client asks you to model Omaha or five-card draw for example.

In reguard to your index question, yes, I would use a primary key as that will allow you to quickly reference a specific hand in your code.

Update

In response to the OP's Edit: It sounds like you are using the wrong tool for this task. What is the value of having the data in a Database if you are always going to select the exact same recordset? Examine other options (like a flat XML file, or a static DataSet in your code for example). It will save you the connection time and the overhead of running a server for what is essentially static data.

Rob Allen