views:

280

answers:

1

I have a table which indexes the locations of words in a bunch of documents. I want to identify the most common bigrams in the set.

How would you do this in MSSQL 2008? the table has the following structure:

LocationID -> DocID -> WordID -> Location

I have thought about trying to do some kind of complicated join... and i'm just doing my head in.

Is there a simple way of doing this?

I think I better edit this on monday inorder to bump it up in the questions

Sample Data

LocationID  DocID WordID Location
21952       534  27 155
21953       534         109   156
21954       534     4     157
21955       534    45   158
21956       534   37   159
21957       534  110  160
21958       534  70   161
A: 

It's been years since I've written SQL, so my syntax may be a bit off; however, I believe the logic is correct.

SELECT CONCAT(i.WordID, "|", j.WordID) as bigram, count(*) as freq
FROM index as i, index as j
WHERE j.Location = i.Location+1 AND 
      j.DocID = i.DocID
GROUP BY bigram
ORDER BY freq DESC

You can also add the actual word IDs to the select list if that's useful, and add a join to whatever table you've got that dereferences WordID to actual words.

Triptych
I would add a separator in the CONCAT, you don't want 12,3 to be like 1,23
Osama ALASSIRY
@Osama - good point - added one in.
Triptych