views:

271

answers:

4

I have a dataset which contains a string key field and up to 50 keywords associated with that information. Once the data has been inserted into the database there will be very few writes (INSERTS) but mostly queries for one or more keywords.

I have read "Tagsystems: performance tests" which is MySQL based and it seems 2NF appears to be a good method for implementing this, however I was wondering if anyone had experience with doing this with SQL Server 2008 and very large datasets.

I am likely to initially have 1 million key fields which could have up to 50 keywords each.

Would a structure of

keyfield, keyword1, keyword2, ... , keyword50

be the best solution or two tables

keyid
keyfield
| 1
|
| M
keyid
keyword

Be a better idea if my queries are mostly going to be looking for results that have one or more keywords?

+2  A: 

As long as you have correct indexes, 50M rows isn't that much. I would just store it as

CREATE TABLE mytable (
    keyfield nvarchar(200),
    keyword nvarchar(200),
    CONSTRAINT PK_mytable PRIMARY KEY(keyfield, keyword)
)

and, of course index the keyword column. If you never need to get all keywords for a keyfield, you can avoid the extra index by just changing the order in the primary key

Edit: I should not post when I'm too tired. This is the way.

erikkallen
But I have 50 keywords not one unless I've misunderstood your explanation.
gary
Sorry, my bad. Updated now.
erikkallen
+2  A: 

Normalized is probably your better bet, but only a simulate work load will know for sure. You're comparing 50 increasingly sparse indexes of 1 million rows each vs 1 index of 50 million rows. I suspect that if I was a genius at MS writing a algorithm to search one index, I would pick up the values I was looking for as I went a long in one pass.

But if there are 50 indexes, I'd have to scan 50 indexes.

Also, in the denormalized schema, the 1st column will have a high quality index, the 50th column will have a low selectivity and probably result in scans rather than index lookups.

MatthewMartin
+1 for the comment about selectivity, probably has the biggest impact
Rick
A: 

I can't imagine queries like

SELECT  keyfield FROM mytable
  WHERE keyword1 in (value1, value2, ...)
     OR keyword2 in (value1, value2, ...)
     OR keyword3 in (value1, value2, ...)
     ....
     OR keyword5 = in (value1, value2, ...)

Your second option looks much better SELECT keyfield FROM mytable WHERE keyword in (value1, value2, ...)

You will want to experiment with indices and engines to get the best performance, but you will probably want one index only on keywords.

Lucky
+3  A: 

I would normalize a step further.

You should have a table of unique KeyWords with an integer primary key column. Then, another association table that has KeyField and KeyWordId.

KeyWords
----------
KeyWordId Int Identity(1,1)
KeyWord VarChar(200)

KeyFieldKeyWords
----------------
Keyfield Int
KeyWordId Int

With 1 million keyfields having 50 keywords each, that's 50 million rows. There will be a HUGE difference in performance if you have a table with 2 columns, each being an integer.

G Mastros
This is how I've implemented it and it appears to be the fastest way to store this sort of data in SQL Server
gary