views:

191

answers:

1

Hello,

A part of a process requires to apply String Similarity Algorithms.

The results of this process will be stored and produce lets say SS_Dataset.

Based on this Dataset, further decisions will have to be made.

My questions are:

  • Should i apply one or more string similarity algorithms to produce SS_Dataset ?

  • Any comparisons between algorithms that calculate the 'distance' and the 'Sounds Like' similarity ?

Does one family of algorithms produces more accurate results over the other? Does a combination give more accurate results on similarity?

  • Can you recommend implementations that you have worked with?

My implementation will include packages from the following libraries

http://www.dcs.shef.ac.uk/~sam/simmetrics.html

http://jtmt.sourceforge.net/

Regards,

A: 

Which is best totally depends on what you're trying to do. Soundex and minimum edit distance (aka Levenshtein) are in broad use because they're easy to understand. They are good when you are trying to deal with typos or misspellings in the input. I'm sorry I can't help beyond "you'll have to experiment yourself with how well those work for your particular purpose."

redtuna
I know I have to experiment. Thank you for the time to reply though..What i am trying to do is match records (abstract term) of items from different resources. These records have nothing in common expect from the NAME attribute. I need to minimize the chance of getting wrong matches and I was thinking multiple algorithmic applications for 'distance' and 'sounds-like' calculations...cheers
andreas