I have a table of items linked to a table of keywords. I want to list all items that share keywords and I want to rank the results so that items that are most similar appear highest. So, I figure items that match all keywords will come highest with items that match only one keyword appearing lower in the list.
Where items match the same number of keywords I want to then rank the results by the frequency of a keyword, so that items that share an unusual keyword will appear higher than items that share a keyword that appears frequently.
Currently my schema is like this (but this could be changed if necessary):
Item ItemKeywords Keywords
____ ____________ ________
ID ItemID ID
Item KeywordID Word
I'm struggling to even start with the sql to get the items that match most keywords. Can anyone help? I'm using Sql Server 2005
EDIT - (added for completeness)
Using Beth's SQL as a pointer I've come up with the following, which I think is working as I want, just got to do a little more testing:
SELECT IK1.ItemID,
IK2.ItemID,
COUNT(IK2.KeywordID) AS KeywordCount,
MIN(WordFrequency) AS WordFrequency
FROM ItemKeywords IK1
INNER JOIN ItemKeywords IK2
ON IK1.KeywordID = IK2.KeywordID --Keywords match
AND IK1.ItemID < IK2.ItemID --Eliminate pairs (Bill & Ted but not Ted and Bill or Bill and Bill)
INNER JOIN (SELECT KeywordID, COUNT(*) WordFrequency
FROM dbo.ItemKeywords
GROUP BY KeywordID) AS KF
ON KF.KeywordID = IK2.KeywordID --Join to sub-query to get frequency of word
GROUP BY IK1.ItemID, IK2.ItemID
ORDER BY COUNT(IK2.KeywordID) DESC, MIN(WordFrequency) ASC --Order by most matching keywords then by most unusual word
I've altered the syntax slightly to avoid the sub-query (doing everything on the join), I've changed IK1.ItemID <> IK2.ItemID
to IK1.ItemID < IK2.ItemID
to eliminate pairs, and I've added the second order by item to take into account the frequency that a word is used too. But essentially Beth answered the question so I will mark it as the answer (many thanks)