views:

419

answers:

3

I'm trying to implement a custom comparer on two lists of strings and use the .Except() linq method to get those that aren't one one of the lists. The reason I'm doing a custom comparer is because I need to do a "fuzzy" compare, i.e. one string on one list could be embedded inside a string on the other list.

I've made the following comparer

public class ItemFuzzyMatchComparer : IEqualityComparer<string>
{
    bool IEqualityComparer<string>.Equals(string x, string y)
    {
        return (x.Contains(y) || y.Contains(x));
    }

    int IEqualityComparer<string>.GetHashCode(string obj)
    {
        if (Object.ReferenceEquals(obj, null))
            return 0;
        return obj.GetHashCode();
    }
}

When I debug, the only breakpoint that hits is in the GetHashCode() method. The Equals() never gets touched. Any ideas?

+4  A: 

If all the hash codes returned are different, it never needs to compare for equality.

Basically the problem is that your hash and equality concepts are very different. I'm not entirely sure how you'd correct this, but until you've done so it certainly won't work.

You need to make sure that if Equals(a, b) returns true, then GetHashCode(a) == GetHashCode(b). (The reverse doesn't have to be true - hash collisions are acceptable, although obviously you want to have as few of them as possible.)

Jon Skeet
I'm starting to think this is a case of attempting to apply a custom compare on a collection of predefined objects (i.e. strings). If I were to have to collections of custom objects, then I could probably get it to work. I think I'll have to come up with a better way than that. :( I'll leave this as unanswered for a day to see if anyone else has a suggestion.
Joe
I ended up doing it in two steps. I generated a list of items that did partially match using a contains, then turned around and did an except using that subset against the first list. Thanks for your help.
Joe
+1  A: 

As Jon pointed out, you need to make sure that the hash-code of two strings that are equal (according to your comparison rule). This is unfortunatelly quite difficult.

To demonstrate the problem, Equals(str, "") returns true for all strings str, which essentially means that all strings are equal to an empty string and as a result, all strings must have the same hash-code as an empty string. Therefore, the only way to implement IEqualityComparer correctly is to return always the same hash-code:

public class ItemFuzzyMatchComparer : IEqualityComparer<string>  { 
  bool IEqualityComparer<string>.Equals(string x, string y)  { 
    return (x.Contains(y) || y.Contains(x)); 
  }  
  int IEqualityComparer<string>.GetHashCode(string obj)  { 
    if (Object.ReferenceEquals(obj, null)) return 0; 
    return 1; 
  } 
}

Then you can use the Except method and it will behave correctly. The only problem is that you'll (probably) get a pretty inefficient implementation, so if you needed better performance, you may have to implement your own Except. However, I'm not exactly sure how inefficient the LINQ implementation will be and I'm not sure if it is actually possible to have any efficient implementation for your comparison rule.

Tomas Petricek
A: 

Maybe this problem could be solved without the IEqualityComparer interface implementation. Jon and Thomas have good points about implementing that interface, and equality doesn't seem to define your problem. From your description, I think you could do this without using the Except extension during the compare. Instead, get the matches first, then do the Except. See if this does the job for you:

 List<String> listOne = new List<string>(){"hard", "fun", "code", "rocks"};
 List<String> listTwo = new List<string>(){"fund", "ode", "ard"};

 var fuzzyMatchList = from str in listOne
                      from sr2 in listTwo
                      where str.Contains(sr2) || sr2.Contains(str)
                      select str;
 var exceptList = listOne.Except(fuzzyMatchList);
Audie