views:

235

answers:

2

I'm looking for the optimal solution for keyword matching between different records in the database. It's a classic problem, I've found similar questions, but nothing concretely.

I've done it with full text searches, joins and subqueries, temp tables, ... so i'd really like to see how you guys are solving such a common problem.

So, let's say I have two tables; Products and Keywords and they are linked with the third table, Products_Keywords in a classic many-to-many relationship.

If I show one Product record on the page and would like to show top n related products, what would be the best option?

We should take into account that records might share several keywords and this fact should determine the ordering of the top related product.

I'm open for other ideas as well, but T-SQL would be preferable solution due to the performance reasons.

A: 

Well maybe something like the follwing:

select p.productId, p.name, r.rank
from products p inner join (
/* this inner select should bring in only products that have at least one keyword
=> shared with the requested product, and will count the actual number shared (for ranking)*/
    select related.productId, count(related.productId) as rank
    from
     products_keywords related inner join 
     products_keywords pk ON (pk.productId = @productId  AND related.keywordId = pk.keywordId)
    where related.productId <> @productId
    group by related.productId
) r on p.productId = r.productId
order by r.rank DESC /* added DESC (not in orignal solution, but needed to put higher ranked on top)*/

Now I seriously doubt that's an optimal sql statement, but it should get the job done. I can't verify it though since I just wrote it from scratch with no actual backing tables, or data to test against.

Josh
if you wanted weighted keywords for above and you choose your weights to be a number between 0 > x >= 1, you can set the rank to (count(related.productId)*(sum(pk.weight)/sum(pk.weight) + 1)) as rankor [C * (w/w+1) = rank] where C is the count and w is the summed weight.
Josh
+3  A: 

My first shot would be something like:

SELECT
    P.product_id,
    COUNT(*)
FROM
    Product_Keywords PK1
INNER JOIN Product_Keywords PK2 ON
    PK2.keyword_id = PK1.keyword_id
INNER JOIN Products P ON
    P.product_id = PK.product_id
WHERE
    PK1.product_id = @product_id
GROUP BY
    P.product_id
ORDER BY
    COUNT(*) DESC

The join of Product_Keywords to Product_Keywords (PK2 to PK1) might be rough, so I can't speak to performance. This is where I would start though and then look at optimization.

One thing to consider, as a follow-up to Assaf's comment, is that you could add a "weight" to the Product_Keywords and SUM(PK1.weight) + SUM(PK2.weight) for ranking. Just a thought.

EDIT: To elaborate on the weighting... you may decide that you want to allow keywords to be weighted. The actual method used to determine the weighting would be a business decision though, so I can't really give you too much guidance there.

As an example though, this question is about "programming", "keyword matching", and "SQL". Programming is pretty generic, so if two questions had that in common it still might not mean that they are that related so maybe you only weight it as 1. SQL is a little more specific, so that you might weight as a 5. Keyword matching is both the main focus of the question AND it's pretty specific, so you might weight that with a 10.

This is just an example of course and as I said, the exact determination of the weights as well as how you score it are dependent on the specific business. You might decide that matching the number of keywords is more important than the weights so maybe the weighting is only used as a tie-breaker, etc. HTH.

Tom H.
Can you please elaborate a little bit about the weight concept? How should I weight keywords?
muerte