views:

43

answers:

1

I have a table of items linked to a table of keywords. I want to list all items that share keywords and I want to rank the results so that items that are most similar appear highest. So, I figure items that match all keywords will come highest with items that match only one keyword appearing lower in the list.

Where items match the same number of keywords I want to then rank the results by the frequency of a keyword, so that items that share an unusual keyword will appear higher than items that share a keyword that appears frequently.

Currently my schema is like this (but this could be changed if necessary):

Item    ItemKeywords    Keywords
____    ____________    ________
ID       ItemID          ID
Item    KeywordID       Word

I'm struggling to even start with the sql to get the items that match most keywords. Can anyone help? I'm using Sql Server 2005

EDIT - (added for completeness)

Using Beth's SQL as a pointer I've come up with the following, which I think is working as I want, just got to do a little more testing:

   SELECT IK1.ItemID, 
      IK2.ItemID, 
      COUNT(IK2.KeywordID) AS KeywordCount, 
      MIN(WordFrequency) AS WordFrequency
    FROM ItemKeywords IK1
    INNER JOIN ItemKeywords IK2 
        ON IK1.KeywordID = IK2.KeywordID --Keywords match
        AND IK1.ItemID < IK2.ItemID --Eliminate pairs (Bill & Ted but not Ted and Bill or Bill and Bill)
    INNER JOIN (SELECT KeywordID, COUNT(*) WordFrequency 
                FROM dbo.ItemKeywords 
                GROUP BY KeywordID) AS KF 
        ON KF.KeywordID = IK2.KeywordID --Join to sub-query to get frequency of word
    GROUP BY IK1.ItemID, IK2.ItemID
    ORDER BY COUNT(IK2.KeywordID) DESC, MIN(WordFrequency) ASC  --Order by most matching keywords then by  most unusual word

I've altered the syntax slightly to avoid the sub-query (doing everything on the join), I've changed IK1.ItemID <> IK2.ItemID to IK1.ItemID < IK2.ItemID to eliminate pairs, and I've added the second order by item to take into account the frequency that a word is used too. But essentially Beth answered the question so I will mark it as the answer (many thanks)

+2  A: 
Select keywordID, count(itemID) as cnt_item
from itemKeywords
group by keywordID
order by count(itemID) desc

OK, understanding better what you don't want, try joining 2 instances of the itemKeyword table on the keywordID field where the itemID doesn't match:

select ik.itemID, ik1.itemID as itemID1, ik.keywordID
from itemkeyword ik inner join itemkeyword ik1 on ik.keywordID = ik1.keywordID
where ik.itemID <> ik1.itemID

then, with that set, count the number of keywordIDs and return the list in descending order:

select itemID, itemID1, count(keywordID) as count_kwd
from 
(    select ik.itemID, ik1.itemID as itemID1, ik.keywordID
    from itemkeyword ik inner join itemkeyword ik1 on ik.keywordID = ik1.keywordID
    where ik.itemID <> ik1.itemID
) t
group by itemID, itemID1
order by count(keywordID) desc
Beth
Thanks Beth, but this lists the keyword ids with the ones linked to the most items at the top. I want the items, with the ones most similar to other items at the top.
Colin
PS you forgot to put in "GROUP BY KeywordID"
Colin
Yup, that edit's the approach I would be taking, I think.
Matt Gibson