views:

135

answers:

7

Here's the problem -- I have a few thousand small text snippets, anywhere from a few words to a few sentences - the largest snippet is about 2k on disk. I want to be able to compare each to each, and calculate a relatedness factor so that I can show users related information.

What are some good ways to do this? Are there known algorithms for doing this that are any good, are there any GPL'd solutions, etc?

I don't need this to run in realtime, as I can precalculate everything. I'm more concerned with getting good results than runtime.

I just thought I would ask the Stack Overflow community before going and writing my own thing. There HAVE to be people out there who have found good solutions to this before.

+3  A: 

I've never used it, but you might want to look into Levenshtein distance

Alex
Works well in many scenarios
Todd Stout
Levenshtein gives edit distance, not semantic differences.
Bob
+1  A: 

Jeff talked about something like this on the pod cast to find the Related questions listed on the right side here. (in podcast 32)

One big tip was to remove all common words, like "the" "and" "this" etc. This will leave you with more meaningful words to compare.

And here is a similar question http://stackoverflow.com/questions/62328/is-there-an-algorithm-that-tells-the-semantic-similarity-of-two-phrases

Bob
A: 

This book may be relevant.

Edit: here is a related SO question

Dima
Thank you. Information Retrieval is the general topic, and this book probably has good information in it.
Matt
+2  A: 

These articles on semantic relatedness and semantic similarity may be helpful. And this SO question about Latent Semantic Analysis.

You could also look into Soundex for words that "sound alike" phonetically.

jjclarkson
Thanks. Latent semantic analysis looks promising, I will have to read up and see about implementing it.
Matt
A: 

This is quite doable for reasonable large texts, however harder for smaller texts.

I did it once like this, and it worked pretty well:

  • Filter all "general" words (like a, an, the, in, etc...) (filters about 10-30% of the words)
  • Count the frequencies of the remaining words, store the top x of most frequent words, these are your topics.
  • As an extra step you can create groups of 2/3/4 subsequent words and compare them with the groups in other texts. I used it as a measure for plagerism.
Henri
A: 

See Manning and Raghavan course notes about MinHashing and searching for similar items, and a C#(?) version. I believe the techniques come from Ullman and Motwani's research.

Yuval F
A: 

Phonetic algorithms

The article, Beyond SoundEx - Functions for Fuzzy Searching in MS SQL Server, shows how to install and use the SimMetrics library into SQL Server. This library lets you find relative similarity between strings and includes numerous algorithms.

I ended up mostly using Jaro Winkler to match on names. Here's more information where I asked about matching names on SO: http://stackoverflow.com/questions/988050/matching-records-based-on-person-name

A few algorithms based on Levenshtein Distance are also available in the SimMetric library and would probably be useful in your application.

Even Mien