views:

2113

answers:

5

Hi

I have a generic with some filenames (LIST1) and another biggeneric with a full list of names (LIST2). I need to match names from LIST1 to similar ones in LIST2. For example

LIST1
- **MAIZE_SLIP_QUANTITY_3_9.1.aif**

LIST 2
1- TUTORIAL_FAILURE_CLINCH_4.1.aif
2- **MAIZE_SLIP_QUANTITY_3_5.1.aif**
3- **MAIZE_SLIP_QUANTITY_3_9.2.aif**
4- TUTORIAL_FAILURE_CLINCH_5.1.aif
5- TUTORIAL_FAILURE_CLINCH_6.1.aif
6- TUTORIAL_FAILURE_CLINCH_7.1.aif
7- TUTORIAL_FAILURE_CLINCH_8.1.aif
8- TUTORIAL_FAILURE_CLINCH_9.1.aif
9- TUTORIAL_FAILURE_PUSH_4.1.aif

I've read about Levenshtein distance and used an implementation of it in a Framework (SignumFramework Utilities). It returns me distance=1 in lines 2 and 3. But in my case line 3 is a better match than line 2.

Is there another method better to compare similar strings? Something more flexible?

+2  A: 

There was a simlar question here, maybe some of the answers there will be relevant?

Steve Haigh
+1. SO needs more people like you.
Dead account
+4  A: 

When comparing as strings, "9.2" is not a better match than "5.1" for "9.1". If you want the version numbers to be evaluated numerically, you have to parse the strings so that you can compare the string parts and the numerical parts separately.

Guffa
I would say it depends on how you define similarity. If you base similarity only on Levenshtein similarity this is correct. But how about a metric that defines two strings sharing the longest common substring to be most similar?
0xA3
(continued) Or having the most n-grams in common? There are many ways strings can be compared, however, there is probably no one-suits-all metric.
0xA3
A: 

Your similarity criteria could be a combination of several other criteria. One could be the Levenshtein distance, others might e.g be the longest common substring or prefix/suffix.

The longest common substring problem is actually a special case of edit distance, when substitutions are forbidden and only exact character match, insert, and delete are allowable edit operations (see here).

Further metrics for string similarity are described here.

0xA3
A: 

A regular expression could be used to get the items that match the name. The version number could be collected in a regex group in the match and parsed into a .net object (e.g. decimal) that you could use to compare which one was closest.

Stevo3000
A: 

There's a fairly exhaustive set of answers to this SO question. At the bottom is link I put up to C# implementations for soundex, double metaphone, PHP similarity and levenstein.

plinth