ansaurus

Question

Speed up text comparisons (feature vectors) with spatial MySQL features

Answer 1

+1 A:

In fact you have only 75 * 74 / 2 = 2775 comparisons. You compare every word with 74 others, but you don't need to compare word1 with word2 and again word2 with word1. So it gives half of comparisons less.

Lukasz Lysik 2009-09-22 15:05:35

Thanks, that's right. :) But it's still a lot. And I don't compare words but texts.

2009-09-22 15:14:12

Answer 2

+1 A:

While R-Trees in general can index data with arbitrary number of dimensions, MySQL spatial abilities are only limited to Geometry types (2 dimensions).

If your vectors are 2-dimensional and you can normalize them, then do the following:

Split the circle into twice the number of angles which fit your differences
Find the MBR of vectors with given cosine difference from the center of each sector
Find all vectors within the MBR
Do the fine filtering for exact difference.

In this case, however, it will be better just to precaculate the angle of the value and index it with a plain B-Tree index.

Quassnoi 2009-09-22 15:06:35

I've added some details about my function and the vectors which the function takes. Do you think your approach is possible with these?

2009-09-22 15:30:54

Since your vectors are located on the surface of the orthotope, it would be possible if you had a fixed number of dimensions (that is a fixed set of tokens) and `MySQL` would be able to build an `R-Tree` on this number of dimensions. Since neither of these is possible, this solution is not viable too.

Quassnoi 2009-09-22 16:13:13

So I can forget about this approach with MySQL's spatial features and look for another way? No possibility?

2009-09-22 18:08:49

`@marco92w`: of course never say "never" but as for now I don't see a way to use `MySQL`'s spatial abilities here.

Quassnoi 2009-09-22 19:09:09

You can certainly implement an R-Tree in code, either extending MySQL or in PHP (or a PECL extension). It's not a minor endeavour, for sure.

Vinko Vrsalovic 2009-09-22 21:30:40

In fact KD Trees might be a better choice for this kind of scenario

Vinko Vrsalovic 2009-09-22 21:39:40

`@Vinko Vrsalovic`: I'm not sure that `R-Trees` or `KD-trees` will suit this task at all, given that the number of dimensions is not known in advance.

Quassnoi 2009-09-22 22:08:54

R-Trees and KD-Trees are not in the list of spatial types I posted. Where do I find these types? And Quassnoi is right: Maybe there are 1,000,000 possible features since the amount of words is increasing every day.

2009-09-23 18:55:02

ansaurus

tags:

views:

answers:

Speed up text comparisons (feature vectors) with spatial MySQL features

related questions