tags:

views:

226

answers:

2

I'm aware of SOUNDEX and (double) Metaphone, but these don't let me test for the similarity of words as a whole - for example "Hi" sounds very similar to "Bye", but both of these methods will mark them as completely different.

Are there any libraries in Ruby, or any methods you know of, that are capable of determining the similarity between two words? (Either a boolean is/isn't similar, or numerical 40% similar)

edit: Extra bonus points if there is an easy method to 'drop in' a different dialect or language!

A: 

You might first preprocess the words using a thesaurus database, which will convert words with similar meaning to the same word. There are various thesaurus databases out there, unfortunately I couldn't find a decent free one for English ( http://www.gutenberg.org/etext/3202 is the one I found, but this doesn't show what relations the specific words have (like similar; opposite; alternate meaning; etc.), so all words on the same line have some relation, but you won't know what that relation is )

But for example for Hungarian there is a good free thesaurus database, but you don't have soundex/metaphone for hungarian texts...

If you have the database writing a program that preprocesses the texts isn't too hard (ultimately it's a simple search-replace, but you might want to preprocess the thesaurus database using simplex or methaphone too)

SztupY
Meaning isn't important here so, I'd want "hi" and "high" to score 100% similarity, "hi", "bye" and "die" to score close to 100% between each, but "encephalography" and "teacup" to be 0%. converting via a thesaurus will confuse the issue I think!
JP
+2  A: 

I think you're describing levenshtein distance. And yes, there are gems for that. If you're into pure ruby go for the text gem.

$ gem install text

The docs have more details, but here's the crux of it:

Text::Levenshtein.distance('test', 'test')    # => 0
Text::Levenshtein.distance('test', 'tent')    # => 1

If you're ok with native extensions...

$ gem install levenshtein

It's usage is similar. It's performance is very good. (It handles ~1000 spelling corrections per minute on my systems.)

If you need to know how similar two words are, use distance over word length.

If you want a simple similarity test, consider something like this:

Untested...but straight forward:

String.module_eval do
   def similar?(other, threshold=2)
    distance = Text::Levenshtein.distance(self, other)
    distance <= threshold
  end
end
Levi
I should also mention, levenshtein distance doesn't care what language you're using. Wikipedia can provide good details here: http://en.wikipedia.org/wiki/Levenshtein_distance
Levi
You could get really fancy, and calculate thresholds based on the size of the input strings. If the word is short (ie search terms) you probably want low thresholds.
Levi
Wow! This is fantastic! I'm trying to compile a list of (dictionary) words and their uniqueness by pronunciation (in a given dialect). Seeings as its all relative I think I'll iterate through every combination of words, sum each word's distance from every other, then divide by the maximum in the list. The goal is to make a URL 'shortener' that will make distinct, easy to remember vocal URLs. I might try to use the IPA representations of words in each dialect as well ("Grass" and "Pasta", American are similar and 'Southern' English accent are relatively different)
JP
Ooh, of course this doesn't take into account the variable distance between two characters; having converted to IPA: `smile -> snail => smaɪl -> sneɪl` would have a small Levenshtein distance, where `gait -> late => /geɪt/ -> /leɪt/` would have the same distance, but snail and smile sound more similar than gait and late, due to the phonetic differences between the `g` and `l` and similarity between `n` and `m` (Imagine words like 'fan' and 'fang'). Assuming I could create an accurate numerical map of the cost of conversions between various sounds, could this be imported into ruby Leveinshtein?
JP