ansaurus

Question

How do I compare phrases for similarity?

Answer 1

+7 A:

One approach is the so called bag-of-words model.

As you guessed, first you count how many times words appear in the text (usually called document in the NLP-lingo). Then you throw out the so called stop words, such as "the", "a", "or" and so on.

You're left with words and word counts. Do this for a while and you get a comprehensive set of words that appear in your documents. You can then create an index for these words: "aardvark" is 1, "apple" is 2, ..., "z-index" is 70092.

Now you can take your word bags and turn them into vectors. For example, if your document contains two references for aardvarks and nothing else, it would look like this:

[2 0 0 ... 70k zeroes ... 0].

After this you can count the "angle" between the two vectors with a dot product. The smaller the angle, the closer the documents are.

This is a simple version and there other more advanced techniques. May the Wikipedia be with you.

Antti Rasinen 2008-09-16 09:18:01

Answer 2

+1 A:

From my (rather small) experience developing full-text search engines: I would look up questions which contain some words from query (in your case, query is your question). Sure, noise words should be ignored and we might want to check query for 'strong' words like 'ASP.Net' to narrow down search scope. http://en.wikipedia.org/wiki/Index_(search_engine)#Inverted_indices'>Inverted indexes are commonly used to find questions with words we are interested in.

After finding questions with words from query, we might want to calculate distance between words we are interested in in questions, so question with 'phrases similarity' text ranks higher than question with 'discussing similarity, you hear following phrases...' text.

Sergey Volegov 2008-09-16 09:26:40

Answer 3

+1 A:

@Hanno you should try the Levenshtein distance algorithm. Given an input string s and a list of of strings t iterate for each string u in t and return the one with the minimum Levenshtein distance.

http://en.wikipedia.org/wiki/Levenshtein_distance

See a Java implementation example in http://www.javalobby.org/java/forums/t15908.html

smink 2008-09-16 10:11:25

Answer 4

+1 A:

To augment the bag-of-words idea:

There are a few ways you can also pay some attention to n-grams, strings of two or more words kept in order. You might want to do this because a search for "space complexity" is much more than a search for things with "space" AND "complexity" in them, since the meaning of this phrase is more than the sum of its parts; that is, if you get a result that talks about the complexity of outer space and the universe, this is probably not what the search for "space complexity" really meant.

A key idea from natural language processing here is that of mutual information, which allows you (algorithmically) to judge whether or not a phrase is really a specific phrase (such as "space complexity") or just words which are coincidentally adjacent. Mathematically, the main idea is to ask, probabilistically, if these words appear next to each other more often than you would guess by their frequencies alone. If you see a phrase with a high mutual information score in your search query (or while indexing), you can get better results by trying to keep these words in sequence.

Tyler 2008-09-16 13:08:38

ansaurus

tags:

views:

answers:

How do I compare phrases for similarity?

related questions