ansaurus

Question

Compare large sets of weighted tag clouds?

Answer 1

A:

First you need to normalize every tag cloud like you would do for a vector, assuming that a tag cloud is a n-dimensional vector in which every dimension rapresents a word and its value rapresents the weight of the word.

You can do it by calculating the norm (or magnitude) of every cloud, that is the square root of all the weights squared:

m = sqrt( w1*w1 + w2*w2 + ... + wn*wn)

then you generate your normalized tag cloud by dividing each weight for the norm of the cloud.

After this you can easily calculate similarity by using a scalar product between the clouds, that is just multiply every component of each pair and all all of them together. Eg:

v1 = { a: 0.12, b: 0.31; c: 0.17; e:  0.11 }
v2 = { a: 0.21, b: 0.11; d: 0.08; e:  0.28 }

similarity = v1.a*v2.a + v1.b*v1.b + 0 + 0 + v1.e*v2.e

if a vector has a tag that the other one doesn't then that specific product is obviously 0.

This similarity in within range [0,1], 0 means no correlation while 1 means equality.

Jack 2010-06-19 16:21:59

While the theory seems sound, I'm not sure how this would be implemented when comparing thousands of sets of tags on the fly, in one happy statement..

FelixHCat 2010-06-19 16:41:29

Usually these intensive tasks are not needed to be real-time so you don't really need to be able to do them within MySQL, just get the clouds and work on them in an asynchronous way. Then store the results inside the DB.

Jack 2010-06-19 16:43:07

ansaurus

tags:

views:

answers:

Compare large sets of weighted tag clouds?

related questions