views:

1748

answers:

10

input: phrase 1, phrase 2

output: semantic similarity value (between 0 and 1), or the probability these two phrases are talking about the same thing

+1  A: 

This requires your algorithm actually knows what your talking about. It can be done in some rudimentary form by just comparing words and looking for synonyms etc, but any sort of accurate result would require some form of intelligence.

Rik
A: 

As far as simple methods go,

return ((double)rand())/RAND_MAX;

is as good as any. More complex solutions are available, involving huge thesauruses and knowledge bases, but practical results are still not much better than code above.

ima
Actually, { return 0.5 } is even better - they are either similar or not.
ima
+2  A: 

I would look into latent semantic indexing for this. I believe you can create something similar to a vector space search index but with semantically related terms being closer together i.e. having a smaller angle between them. If I learn more I will post here.

jonfm
+1  A: 

Try the Levenshtein Distance:

http://en.wikipedia.org/wiki/Levenshtein_distance

belwood
I don't think that will work, because short phrases of slightly different length show up with a huge edit distance. Maybe if you modified it to get the edit-distances between certain words within the phrase...
JosephStyons
Edit distance and semantic distance aren't related.
chaos
+10  A: 


You might want to check out this paper:

Sentence similarity based on semantic nets and corpus statistics

I've implemented the algorithm described. Our context was very general (effectively any two English sentences) and we found the approach taken was too slow and the results, while promising, not good enough (or likely to be so without considerable, extra, effort).

You don't give a lot of context so I can't necessarily recommend this but reading the paper could be useful for you in understanding how to tackle the problem.

Regards,

Matt.

Matt Mower
I've implemented the algorithm too, it's not good enough but acceptable
btw0
+1  A: 

One simple solution is to use the dot product of character n-gram vectors. This is robust over ordering changes (which many edit distance metrics are not) and captures many issues around stemming. It also prevents the AI-complete problem of full semantic understanding.

To compute the n-gram vector, just pick a value of n (say, 3), and hash every 3-character sequence in the phrase into a vector. Normalize the vector to unit length, then take the dot product of different vectors to detect similarity.

+2  A: 

Besides Levenshtein, there are other methods in determining string similarity. I found this page very useful back when I needed them.

marko
+1  A: 

You might want to check into the WordNet project at Princeton University. One possible approach to this would be to first run each phrase through a stop-word list (to remove "common" words such as "a", "to", "the", etc.) Then for each of the remaining words in each phrase, you could compute the semantic "similarity" between each of the words in the other phrase using a distance measure based on WordNet. The distance measure could be something like: the number of arcs you have to pass through in WordNet to get from word1 to word2.

Sorry this is pretty high-level. I've obviously never tried this. Just a quick thought.

luv2lrn
+9  A: 

There's a short and a long answer to this.

The short answer:

Use the WordNet::Similarity Perl package. If Perl is not your language of choice, check the WordNet project page at Princeton, or google for a wrapper library.

The long answer:

Determining word similarity is a complicated issue, and research is still very hot in this area. To compute similarity, you need an appropriate represenation of the meaning of a word. But what would be a representation of the meaning of, say, 'chair'? In fact, what is the exact meaning of 'chair'? If you think long and hard about this, it will twist your mind, you will go slightly mad, and finally take up a research career in Philosophy or Computational Linguistics to find the truth™. Both philosophers and linguists have tried to come up with an answer for literally thousands of years, and there's no end in sight.

So, if you're interested in exploring this problem a little more in-depth, I highly recommend reading Chapter 20.7 in Speech and Language Processing by Jurafsky and Martin, some of which is available through Google Books. It gives a very good overview of the state-of-the-art of distributional methods, which use word co-occurrence statistics to define a measure for word similarity. You are not likely to find libraries implementing these, however.

nfelger
+2  A: 

I would have a look at statistical techniques that take into consideration the probability of each word to appear within a sentence. This will allow you to give less importance to popular words such as 'and', 'or', 'the' and give more importance to words that appear less regurarly, and that are therefore a better discriminating factor. For example, if you have two sentences:

1) The smith-waterman algorithm gives you a similarity measure between two strings. 2) We have reviewed the smith-waterman algorithm and we found it to be good enough for our project.

The fact that the two sentences share the words "smith-waterman" and the words "algorithms" (which are not as common as 'and', 'or', etc.), will allow you to say that the two sentences might indeed be talking about the same topic.

Summarizing, I would suggest you have a look at: 1) String similarity measures; 2) Statistic methods;

Hope this helps.

Gia