views:

115

answers:

2

I have a function which takes two strings and gives out the cosine similarity value which shows the relationship between both texts.

If I want to compare 75 texts with each other, I need to make 5,625 single comparisons to have all texts compared with each other.

Is there a way to reduce this number of comparisons? For example sparse matrices or k-means?

I don't want to talk about my function or about ways to compare texts. Just about reducing the number of comparisons.

A: 

If your algorithm is pair-wise, then you probably can't reduce the number of comparisons, by definition.

You'll need to use a different algorithm, or at the very least pre-process your input if you want to reduce the number of comparisons.

Without the details of your function, it's difficult to give any concrete help.

Ben S
My function calculates the cosine similarity. It takes two arrays containing the tokens/words of the texts. I think you can only calculate cosine similarity pair-wise so you can't reduce the number of comparisons for cosine similarity, right?
Yes, but if you're interested only in certain data, you might be able to avoid doing some comparisons, like Vinko mentioned for similar strings.
Ben S
A: 

What Ben says it's true, to get better help you need to tell us what's the goal.

For example, one possible optimization if you want to find similar strings is storing the string vectors in a spatial data structure such as a quadtree, where you can outright discard the vectors that are too far away from each other, avoiding many comparisons.

Vinko Vrsalovic
Yes, I want to find similar strings. More details are in my comment on Ben's answer. My database (MySQL) seems to have these spatial types: http://dev.mysql.com/doc/refman/5.0/en/mysql-spatial-datatypes.html Theres's nothing about a quadtree!?
Many kinds of spatial indexes can serve you well. Read about those available form MySQL.
Vinko Vrsalovic
I've read a lot now about these spatial features. I've added a paragraph about it to my question. Can you give me further help?
I suggest you to open a different question about how to use mysql spatial extensions for your use case and to leave this question as it was in case somebody has a better algorithm to compare your strings.
Vinko Vrsalovic
Good idea :) The question about MySQL's spatial features is now here: http://stackoverflow.com/questions/1460618/speed-up-text-comparisons-feature-vectors-with-spatial-mysql-features So in this question here, I'm looking for algorithms which can help.