views:

190

answers:

2

What is the best algorithm to match or compute the distance between two strings in C# when the order or number of times a word appears is not important?

Best means:

  • Would mostly agree with a human match
  • Elegant
  • Efficient
  • Scalable, so that an input string could be matched to a potentially large collection of other strings

Related questions:

Some notes:

  • Because of the order and occurrence independence, the inputs can be thought of as sets of unique words, not strings in the sense of arrays of characters
  • Not specifically looking for a database solution, although one would be interesting
  • I'm way too old for this to be a homework problem ;)
+1  A: 

This looks like a canonical case to apply standard information retrieval algorithms. Cosine distance is what first comes to mind, but there might be better matches to your particular case. This is a good link to start digging on that route:

http://www.miislita.com/information-retrieval-tutorial/cosine-similarity-tutorial.html

Implementation example:

http://stackoverflow.com/questions/520241/cosine-similarity-of-vectors

Vinko Vrsalovic
+1  A: 

Seach for a method called "Double Metaphone" which I beleive for word per word comparision it is the best available. Counts for different languages as well! queit amazing.

If comparing string maybe you can use this along with a cosine similarity. will yeild perfect results.

Marwan
+1 I'll check it out, thanks :)
Thomas Bratt