ansaurus

Question

Best way to store large dataset in SQL Server?

Answer 1

+2 A:

As long as you have correct indexes, 50M rows isn't that much. I would just store it as

CREATE TABLE mytable (
    keyfield nvarchar(200),
    keyword nvarchar(200),
    CONSTRAINT PK_mytable PRIMARY KEY(keyfield, keyword)
)

and, of course index the keyword column. If you never need to get all keywords for a keyfield, you can avoid the extra index by just changing the order in the primary key

Edit: I should not post when I'm too tired. This is the way.

erikkallen 2009-08-07 01:15:37

But I have 50 keywords not one unless I've misunderstood your explanation.

gary 2009-08-07 05:30:19

Sorry, my bad. Updated now.

erikkallen 2009-08-07 13:15:32

Answer 2

+2 A:

Normalized is probably your better bet, but only a simulate work load will know for sure. You're comparing 50 increasingly sparse indexes of 1 million rows each vs 1 index of 50 million rows. I suspect that if I was a genius at MS writing a algorithm to search one index, I would pick up the values I was looking for as I went a long in one pass.

But if there are 50 indexes, I'd have to scan 50 indexes.

Also, in the denormalized schema, the 1st column will have a high quality index, the 50th column will have a low selectivity and probably result in scans rather than index lookups.

MatthewMartin 2009-08-07 01:16:14

+1 for the comment about selectivity, probably has the biggest impact

Rick 2009-08-07 01:20:03

Answer 3

A:

I can't imagine queries like

SELECT  keyfield FROM mytable
  WHERE keyword1 in (value1, value2, ...)
     OR keyword2 in (value1, value2, ...)
     OR keyword3 in (value1, value2, ...)
     ....
     OR keyword5 = in (value1, value2, ...)

Your second option looks much better SELECT keyfield FROM mytable WHERE keyword in (value1, value2, ...)

You will want to experiment with indices and engines to get the best performance, but you will probably want one index only on keywords.

Lucky 2009-08-07 01:18:49

Answer 4

+3 A:

I would normalize a step further.

You should have a table of unique KeyWords with an integer primary key column. Then, another association table that has KeyField and KeyWordId.

KeyWords
----------
KeyWordId Int Identity(1,1)
KeyWord VarChar(200)

KeyFieldKeyWords
----------------
Keyfield Int
KeyWordId Int

With 1 million keyfields having 50 keywords each, that's 50 million rows. There will be a HUGE difference in performance if you have a table with 2 columns, each being an integer.

G Mastros 2009-08-07 01:47:37

This is how I've implemented it and it appears to be the fastest way to store this sort of data in SQL Server

gary 2009-08-18 06:19:33

ansaurus

tags:

views:

answers:

Best way to store large dataset in SQL Server?

related questions