views:

163

answers:

2

I have a function which takes two arrays containing the tokens/words of two texts and gives out the cosine similarity value which shows the relationship between both texts.

The function takes an array $tokensA (0=>house, 1=>bike, 2=>man) and an array $tokensB (0=>bike, 1=>house, 2=>car) and calculates the similarity which is given back as a floating point value.

function cosineSimilarity($tokensA, $tokensB) {
 $a = $b = $c = 0;
 $uniqueTokensA = $uniqueTokensB = array();
 $uniqueMergedTokens = array_unique(array_merge($tokensA, $tokensB));
 foreach ($tokensA as $token) $uniqueTokensA[$token] = 0;
 foreach ($tokensB as $token) $uniqueTokensB[$token] = 0;
 foreach ($uniqueMergedTokens as $token) {
  $x = isset($uniqueTokensA[$token]) ? 1 : 0;
  $y = isset($uniqueTokensB[$token]) ? 1 : 0;
  $a += $x * $y;
  $b += $x;
  $c += $y;
 }
 return $b * $c != 0 ? $a / sqrt($b * $c) : 0;
}

If I want to compare 75 texts with each other, I need to make 5,625 single comparisons to have all texts compared with each other.

Is it possible to use MySQL's spatial columns to reduce the number of comparisons?

I don't want to talk about my function or about ways to compare texts. Just about reducing the number of comparisons.

MySQL's spatial columns

  • You create spatial columns with: CREATE TABLE abc (clmnName TYPE)
  • possible types are listed here
  • here is how I select the data later [e.g. MultiPointFromText() or AsText()]
  • You insert values like this: INSERT INTO clmnName VALUES (GeomFromText('POINT(1 1)'))

But how do you use this for my problem?

PS: I'm looking for ways to reduce the number of comparisons with algorithms in this question. Vinko Vrsalovic told me that I should open another question for the spatial features.

+1  A: 

In fact you have only 75 * 74 / 2 = 2775 comparisons. You compare every word with 74 others, but you don't need to compare word1 with word2 and again word2 with word1. So it gives half of comparisons less.

Lukasz Lysik
Thanks, that's right. :) But it's still a lot. And I don't compare words but texts.
+1  A: 

While R-Trees in general can index data with arbitrary number of dimensions, MySQL spatial abilities are only limited to Geometry types (2 dimensions).

If your vectors are 2-dimensional and you can normalize them, then do the following:

  • Split the circle into twice the number of angles which fit your differences
  • Find the MBR of vectors with given cosine difference from the center of each sector
  • Find all vectors within the MBR
  • Do the fine filtering for exact difference.

In this case, however, it will be better just to precaculate the angle of the value and index it with a plain B-Tree index.

Quassnoi
I've added some details about my function and the vectors which the function takes. Do you think your approach is possible with these?
Since your vectors are located on the surface of the orthotope, it would be possible if you had a fixed number of dimensions (that is a fixed set of tokens) and `MySQL` would be able to build an `R-Tree` on this number of dimensions. Since neither of these is possible, this solution is not viable too.
Quassnoi
So I can forget about this approach with MySQL's spatial features and look for another way? No possibility?
`@marco92w`: of course never say "never" but as for now I don't see a way to use `MySQL`'s spatial abilities here.
Quassnoi
You can certainly implement an R-Tree in code, either extending MySQL or in PHP (or a PECL extension). It's not a minor endeavour, for sure.
Vinko Vrsalovic
In fact KD Trees might be a better choice for this kind of scenario
Vinko Vrsalovic
`@Vinko Vrsalovic`: I'm not sure that `R-Trees` or `KD-trees` will suit this task at all, given that the number of dimensions is not known in advance.
Quassnoi
R-Trees and KD-Trees are not in the list of spatial types I posted. Where do I find these types? And Quassnoi is right: Maybe there are 1,000,000 possible features since the amount of words is increasing every day.