views:

349

answers:

3

I need to sort a List based on the difference between the strings in the list and a target string.

What's the best way of implementing this kind of sorting algorithm?

I don't care too much about performance but the collection could potentially become big (let's say half a million tops).

Any Help Appreciated!

+4  A: 

I would recommend calculating the Levenshtein distance and then simply ordering by the integer result. (Magic code)

public void Example()
{
    string target = "target";

    List<string> myStings = new List<string>();

    myStings.Add("this");
    myStings.Add("that");

    myStrings = myStings.OrderBy(each => Levenshtein(each, target)).ToList();
}

public int Levenshtein(string stringA, string stringB)
{
    // Magic goes here
    return 0;
}

Without OrderBy for the old skool 2.0 guys?

List<string> myStrings;
myStrings.Sort(LevenshteinCompare);
...

public class LevenshteinCompare: IComparer<string>
{
    public int Compare(string x, string y)
    {
        // Magic goes here
    }
}
Dead account
This looks what I am looking for. Beg your pardon for my ignorance - is the Levensthein a simple character based difference count (number of different characters)?
JohnIdol
Levenshtein is commonly use for spell checkers, so L("doog","dog") and L("dag", "dog") both = 1. The first being "add o" the second being "swop a for o"
Dead account
Just as a matter of interest - if I was implementing in .NET 2.0 how would you substitute the OrderBy + Lambda expression?
JohnIdol
nice stuff - thanks for helping
JohnIdol
I've got a problem with this - it compiles and build fine but I can't debug into the Levenshtein method (the breakpoint doesn't get hit) and it looks like it is not ordering! Any idea?
JohnIdol
Try using it manually, rather than using the OrderBy method. Literally call it int i = (new LevenshteinCompare()).Compare("This", "That");
Dead account
I wasn't calling ToList after the OrderBy! It's all good now (edited you code)
JohnIdol
+1  A: 

Just to make the Magic Code a little less magic take a look at:

http://www.merriampark.com/ld.htm

http://www.merriampark.com/ldcsharp.htm

Oliver
+1  A: 

What's the best way of implementing this kind of sorting algorithm?

Being tongue-in-cheek, I'd suggest using the library implementation of quicksort, with the distance to the target string as the sorting key.

That's of course not a helpful answer. Why not? Because what you really want to know is "What's a good difference metric for strings?"

The answer to the real qusetion is, sadly, "it depends"; it depends on which properties of the distance you care about.

That being said, read up on the Levenstein Distance and what it really says about the strings.

You can modify the basic algorithm to skew the metric in favor of identical characters occurring in long runs by fiddling with the weighting of different steps in the dynamic programming matrix.

You can also use the Soundex algorithm, which says something about which strings sound similar (but that works best for short strings; I don't know what kind of input you use).

If the strings are of equal length, you can also use the hamming distance (count the number of indexes where the strings differ). That can probably be generalized to something by counting (unilaterally) non-existing indexes as always different, which gives you something Levenstein-like (kinda' sorta' maybe).

The short version: it depends. I've given some input, but I can't say which is going to be a good decision for you without some more information from you.

Jonas Kölker
Thanks for your answer and the overview - the only thing I care about in this case is the number of different characters - so it looks like a subcase of the Levenstein Distance (for equally long strings) would do for me!
JohnIdol