tags:

views:

32

answers:

1

I have thousands of large sets of tag cloud data; I can retrieve a weighted tag clouds for each set with a simple select/group statement (for example)

SELECT tag, COUNT( * ) AS weight
FROM tags
WHERE set_id = $set_id
GROUP BY tag
ORDER BY COUNT( * ) DESC

What I'd like to know is this -- what is the best way to compare weighted tag clouds and find other sets that are most similar, taking the weight (the number of occurrences within the set) into account and possibly even computing a comparison score, all in one somewhat effiecient statement?

I found the web to be lacking quality literature on the topic, thought it somewhat broadly relevant and tried to abstract my example to keep it generally applicable.

A: 

First you need to normalize every tag cloud like you would do for a vector, assuming that a tag cloud is a n-dimensional vector in which every dimension rapresents a word and its value rapresents the weight of the word.

You can do it by calculating the norm (or magnitude) of every cloud, that is the square root of all the weights squared:

m = sqrt( w1*w1 + w2*w2 + ... + wn*wn)

then you generate your normalized tag cloud by dividing each weight for the norm of the cloud.

After this you can easily calculate similarity by using a scalar product between the clouds, that is just multiply every component of each pair and all all of them together. Eg:

v1 = { a: 0.12, b: 0.31; c: 0.17; e:  0.11 }
v2 = { a: 0.21, b: 0.11; d: 0.08; e:  0.28 }

similarity = v1.a*v2.a + v1.b*v1.b + 0 + 0 + v1.e*v2.e

if a vector has a tag that the other one doesn't then that specific product is obviously 0.

This similarity in within range [0,1], 0 means no correlation while 1 means equality.

Jack
While the theory seems sound, I'm not sure how this would be implemented when comparing thousands of sets of tags on the fly, in one happy statement..
FelixHCat
Usually these intensive tasks are not needed to be real-time so you don't really need to be able to do them within MySQL, just get the clouds and work on them in an asynchronous way. Then store the results inside the DB.
Jack