ansaurus

Question

How to compare list of X to list of Y in C# by using generics?

Answer 1

+3 A:

If you want a reusable answer, not specific to class X and class Y, you'll need reflection. Take a look at Type.GetProperty and PropertyInfo.GetGetMethod.

EDIT: I seem to be getting downvotes from people who aren't familiar with reflection, so I'll add some example source code:

static class PropertyGetter<X>
{
  private static readonly Dictionary<string, Converter<X, object>> cached;

  public Converter<X, object> this[string propertyName]
  {
    get {
      Converter<X, object> result;
      lock (this) if (!cached.TryGetValue(propertyName, out result)) {
        PropertyInfo pi = typeof(X).GetProperty(propertyName, true);
        if (pi == null) throw new ArgumentException("Type " + typeof(X).Name + " has no property named " + propertyName, propertyName);
         MethodInfo getter = pi.GetGetMethod();
         if (getter == null) throw new ArgumentException("Type " + typeof(X).Name + " has a property named " + propertyName + " but it is not readable", propertyName);
         result = (Converter<X, object>)Delegate.CreateDelegate(typeof (Converter<X, object>), getter);
         cached.Add(propertyName, result);
       }
       return result;
     }
  }
}

public class Pair<S,T>
{
   public readonly S first;
   public readonly T second;
   public Pair(S s, T t) { first = s; second = t; }
}

List<Pair<X, Y>> FindCommonEntries<X, Y>(IEnumerable<X> listA, IEnumerable<Y> listB, string propertyNameA, string propertyNameB, out List<X> onlyA, out List<Y> onlyB)
{
    return FindCommonEntries<X,Y>(listA, listB, PropertyGetter<X>[propertyName], PropertyGetter<Y>[propertyName], out onlyA, out onlyB);
}

List<Pair<X, Y>> FindCommonEntries<X, Y>(IEnumerable<X> listA, IEnumerable<Y> listB, Converter<X, object> getA, Converter<Y, object> getB, out List<X> onlyA, out List<Y> onlyB)
{
    Dictionary<object, Pair<List<X>, bool>> mapA = new Dictionary<object, X>();
    foreach (X x in listA) {
      Pair<List<X>,bool> set;
      object key = getA(x);
      if (!mapA.TryGetValue(key, out set))
        mapA.Add(key, set = new Pair<List<X>, bool>(new List<X>(), false));
      set.first.Add(x);
    }

    onlyB = new List<Y>();
    List<Pair<X, Y>> common = new List<Pair<X, Y>>();
    foreach (Y y in listB) {
      Pair<List<X>,bool> match;
      if (mapA.TryGetValue(getB(y), out match)) {
        foreach (X x in match.first) common.Add(x, y);
        match.second = true;
      }
      else
        onlyB.Add(y);
    }

    onlyA = new List<X>();
    foreach (Pair<List<X>, bool> set in mapA.Values) {
      if (!set.second) onlyA.AddRange(set.first);
    }

    return common;
}

EDIT: Added lists of elements without any match.

EDIT: Separated reflection code, so it can be avoided by passing in a lambda.

EDIT: Use Converter delegate type instead of Func, since it's available in .NET 2.0 and actually describes the usage better.

EDIT: Cached property getter delegates to avoid paying the reflection overhead for each list, only for each list type.

EDIT: Added Pair<S,T> class to get rid of the final dependency on .NET > 2.0.

NOTE: A lot of the complexity of this answer deals with the possibility that entries in the specified property are non-unique. I return a Cartesian product of matches, yet without calculating the Cartesian product of both entire lists. Things could be simplified a lot if uniqueness of the key property is assumed.

Ben Voigt 2010-07-24 16:20:00

Reason for down-vote?

Ben Voigt 2010-07-24 16:46:12

Pity people downvote without explaining themselves. While I'm not a favor of using reflection in list-comparison situations as it can be over 400x slower than direct invocation (can slow down whole systems if lists are large), the answer in itself, even though not being generic, is still good. If C# 4.0 is possible, use `dynamic`, which is cached.

Abel 2010-07-24 16:53:14

@Abel: the answer is generic, it works with ANY `Type`, which doesn't have to be known at compile time and can be pulled from the `List<X>` type using `GetGenericParameters`. Also, performance is BETTER than `dynamic` if you use my suggestion of `GetGetMethod` and not the naive way of `PropertyInfo.GetValue`.

Ben Voigt 2010-07-24 16:59:28

Added example code: Clearly there's only one use of reflection per list, not one per element, so for long lists the reflection penalty disappears.

Ben Voigt 2010-07-24 17:40:31

@Ben Voigt, As per your question, this is pretty much the implementation I was thinking of which would be faster. I think this algorithm is at least close to optimal in terms of complexity.

tster 2010-07-24 18:19:38

One thing I might change to aid in readability: at the end`onlyA = mapA.Values.Where(set => !set.second).SelectMany(set => set.first).ToList();`

tster 2010-07-24 18:21:03

Yes, LINQ and `var` for `mapA` and `common` would improve readability slightly but I'd lose .NET 2.0 compatibility.

Ben Voigt 2010-07-24 18:24:02

I'm obsessive about short code, so I'm always looking to use LINQ.

tster 2010-07-24 18:25:02

`Tuple` and `Func` were introduced at the same time as LINQ, weren't they? I guess I have no real reason not to use LINQ then, although it seems wrong to mix LINQ with imperative style. I can at least get rid of `Func` though.

Ben Voigt 2010-07-24 18:41:39

Funny, I was halfway writing something similar, went to bed and didn't refresh the page, finished and tested the code, updated the post and now find that we seem to have somewhat similar answers, with different approaches, but both using Reflection. Note: you use a `Tuple`, which was introduced in .NET 4.0 only (for F#). Note (2): about my earlier remark, I didn't mean that it was a bad solution, I meant you have to be careful about it and know what you do. I'd like to have a reference on that the performance is better than `dynamic`, because in practice that seems not the case.

Abel 2010-07-25 11:18:56

You'll see that I realized in my previous comment that `Tuple` is a new type, in order to use this in .NET 2.0 I'd have to write a very simple `Pair` class. It just doesn't seem right to use `KeyValuePair` for this because neither element is a key. As for the performance, dynamic uses reflection internally, so it can only perform better if it can use its cache to avoid making multiple reflection calls. But my solution only makes each reflection call once anyway, and dynamic still has to do the cache lookup.

Ben Voigt 2010-07-25 14:09:47

quote *"dynamic uses reflection internally"* : plus extremely advanced and well working caching, which was exactly my point. If that weren't the case, the usefulness of `dynamic` would be next to void and equal to using VB's dynamic typing (which is not cached). So: `dynamic` will make each reflection lookup only once.

Abel 2010-07-25 20:07:00

This speaks to another difference between my answer and your new reflection-based answer: Mine does the conversion to delegate once per list and calls the delegate many times. Yours, like `dynamic`, does the conversion to delegate every time. This is cached so reflection only gets used once, but it still incurs a cache lookup for every list element. It's fairly easy to fix in your case, although you'll have to lose the extension method syntax, but I don't think `dynamic` has any way to avoid the repeated lookup.

Ben Voigt 2010-07-25 20:30:45

Answer 2

+7 A:

The best thing to do is to first create an interface that contains T1 only. Then you inherit each class like X and Y from this interface. Now you can easily create your generic classes or any helper classes based on this interface.

Alternatively, you may use reflection, or if you use C# 4.0, you can use dynamic. Classic reflection is way to slow for (large) lists, so unless you cache your method calls, you shouldn't take that approach. C# 4.0 however, provided method caching through the DLR, which is sufficiently fast in most cases.

Alternatively (2): when you want to do this "right" and you want to compare the lists using standard mechanisms like LINQ, you should implement IComparable. You can combinee that with generics to create type-safety.

// the interface, inherit from IComparable
public interface IX : IComparable<IX>
{
    string T1 { get; set; }
}

// create one base class
class XBase : IX
{
    public string T1 { get; set; }
    public int CompareTo(IX obj)
    {
        return this.T1.equals(obj.T1);
    }
}

// inherit all others from base class
class X : XBase
{
    public string T2 { get; set; }
    public string T3 { get; set; }
}

class Y : XBase
{
    public string T2 { get; set; }
    public string T3 { get; set; }

    public strign O1 { get; set; }
}

There are many other ways. The last method above has the advantage of only once writing the logic for T1 and CompareTo, which saves from clutter and creates clarity in your code.

Abel 2010-07-24 16:20:24

But this requires editing hundreds of classes. Better than dealing with thousands of pair-wise combinations, but still not ideal.

Ben Voigt 2010-07-24 16:47:12

@Ben Voigt: Probably, code is generated.

Roman Boiko 2010-07-24 16:50:21

@Ben Voigt: there's little information about how these classes are created, but as Roman says, it is unlikely that so many likewise classes are generated by hand. If so, it is *high time* that some big refactoring takes place. Btw, if refactoring is not possible, use extension methods instead. Less pretty, but it'll get you there.

Abel 2010-07-24 16:55:42

@Abel: are you suggesting an extension method on `System.Object`? That's really frowned on.

Ben Voigt 2010-07-24 17:00:45

Have you actually measured reflection versus dynamic?

Rune FS 2010-07-24 17:19:18

@Abel: Modification of a third-party code generator, such as xsd conversion, may not be possible or feasible. Having a code generator doesn't guarantee that changes can be propagated easily. In fact it may make changes totally impossible, if the code generator has to be rerun periodically and overwrites customizations.

Ben Voigt 2010-07-24 18:10:33

@Ben Voigt, It would be possible to make the classes implement an interface using partial classes so long as the code generator can output partial classes. Even if the code generator being used can't do it, they can write a script which could be run after the generator to put in the partial keyword.

tster 2010-07-24 18:24:06

@tster: But there's more going on here than an interface, a base class is being added. Partial classes can do that too -- as long as there isn't an existing base class. Mostly though it's just unnecessary, as you and I have both shown.

Ben Voigt 2010-07-24 18:27:13

@Ben Voigt: thanks for the remarks. Some answers: if there's a base class already, extension methods become easy. If not, yes, it is being frowned upon, and no, desperate situations seek desperate solutions (see my new answer for how to keep it a bit safe). About auto-generation, the answer lies not in changing the generator (while often you can add customization to keep your roundtrip), but in using partial classes. A parent interface is the easiest solution, typesafe and fast. If that is not an option, a less-perfect solution (incl. extension on Object), can and should be considered.

Abel 2010-07-25 11:26:19

@Rune FS: reflection vs. dynamic is a big difference, because `dynamic` only once has the performance hit of finding and creating the delegate. You can speed up reflection and create your own delegates, which will bring reflection closer to the same performance as using `dynamic` (my new answer and Ben Voigts answer both show how to do just that).

Abel 2010-07-25 11:30:18

@Abel: Better to say: "if there's a **common** base class already, extension methods become easy". It's entirely possible for the base class to vary, even among generated types.

Ben Voigt 2010-07-25 14:12:11

@Ben Voigt: sorry, I thought that was clear, but yes, that's a much better wording! :)

Abel 2010-07-25 19:54:01

Answer 3

+4 A:

I had a really hard time understanding the question. But I read it as "how can I find a diff between lists of these two objects based on their T1 value." However, like I said, this is a total guess as to what the actual question is.

Using linq here is a good start for you:

IEnumerable<string> intersectionT1s = listX.Select(x => x.T1).Intersect(listY.Select(y => y.T1);
IEnumerable<X> intersection = listX.Where(x => intersectionT1s.Contains(x.T1));
IEnumerable<X> onlyOnX = listX.Where(x => !listY.Any(y => y.T1 == x.T1));

I'll leave onlyOnY as an exercise for the reader.

Here is a generic intersection method which you can use:

public static class ExtensionMethods
{
    public static IEnumerable<TLeft> IntersectionOn<TLeft, TRight, TField>(this IEnumerable<TLeft> left,
        IEnumerable<TRight> right, Func<TLeft, TField> leftSelector, Func<TRight, TField> rightSelector)
    {
        var intersectionFields = left.Select(leftSelector).Intersect(right.Select(rightSelector));
        return left.Where(x => intersectionFields.Contains(leftSelector(x)));
    }
}

and the usage:

IEnumerable<X> intersection = listX.IntersectionOn(listY, x => x.T1, y => y.T1);

tster 2010-07-24 16:24:23

`Where<T>()` returns `IEnumerable<T>`, not `List<T>`. Same for `Select`.

Roman Boiko 2010-07-24 16:27:57

@Roman It's easy enough to add a quick `ToList()` to fix that.

drharris 2010-07-24 16:30:19

It don't work with my case, because if I Select(x => x.T1) then return type is type of T1 but I want return type of listX itself.

In The Pink 2010-07-24 16:37:04

This isn't generic or reusable across different class types.

Ben Voigt 2010-07-24 16:44:42

@drharris: of course, you're right. I only wanted to point the problem in case if someone tries to compile this code. Also, the second line returns `IEnumerable<X>`, not of string (which is close to what Ekkapop wanted from the first line, too). I like answers that are correct. :)

Roman Boiko 2010-07-24 16:46:14

@Ekkapop, updated to do what you need.@Roman Boiko, fixed.@Ben Voigt, We could make it generic... Let me see

tster 2010-07-24 17:23:46

@tster: You can't make it generic without invasive changes to all the classes. As soon as you make `X` a generic type argument, the lambda `x => x.T1` will cause a compiler error.

Ben Voigt 2010-07-24 17:30:45

@Ben Voigt, I made a new extension method which does the trick I think. Whenever you want an intersection you just specify which field to use to compare for each side using lambda expressions.

tster 2010-07-24 17:35:23

@tster: You're right, didn't think of accepting a parameter to select the property. Very good work.

Ben Voigt 2010-07-24 17:38:03

Thanks. I don't really like the method as is (looks really inefficient to me), but if he runs into performance problems with it he can always rewrite using a better algorithm.

tster 2010-07-24 17:41:09

@tster: You're right, and my answer could be trivially converted to accept delegates instead of property names and get rid of the reflection cost completely. Do you see any way to make mine more scalable?

Ben Voigt 2010-07-24 18:04:50

+1. Looks like answers are evolutioning and we get a lot of useful code. Of all leaders I didn't up-vote yours before, so do it now. :)

Roman Boiko 2010-07-26 07:41:24

The issue is, here, that things like `x => x.T1, y => y.T1` are not possible given there is (or seems to be, see OP) no common base class or interface. Or maybe I just misunderstand this solution.

Abel 2010-07-26 09:14:25

@Abel: through `leftSelector`, `rightSelector` user of this code can provide his own mapping **for each class** he uses. Like `x => x.SomeProperty`, `y => y.SomeOtherProperty`. This makes the solution very flexible.

Roman Boiko 2010-07-26 16:41:28

Answer 4

+2 A:

var elementInBoth = (from x in ListOfClassX
                     join y in ListOfClassY on x.T1 equals y.T1
                     select x).ToList();

var elementsOnlyInListOfClassX = ListOfClassX.Except(elementsInBoth);

now you can do a left join, or just two-step it. I prefer two-steps as it's more clear.

var elementsToRemoveFromListOfClassY = (from x in ListOfClassX
                                        join y in ListOfClass Y on x.T1 equals y.T1
                                        select y).ToList();

var elementsOnlyInListOfClassY = ListOfClassY.Except(elementsToRemoveFromListOfClassY);

Mike M. 2010-07-24 16:25:51

Neither generic or reusable across multiple class types, as the OP said is desired.

Ben Voigt 2010-07-24 16:45:17

You're right. I seemed to have missed that portion :x.

Mike M. 2010-07-24 17:22:19

what do you mean it's not generic or reusable across multiple classeswhat if it's parts of a method like thiscompare<X,Y>(X ListOfClassX,Y ListOfClassY) constraint appropriately

Rune FS 2010-07-24 17:23:34

@Rune: Because there is no constraint for "X has property T1". .NET generics don't do duck-typing. With C++ templates this approach would work (using imperative syntax for the query instead of, well, language integrated LINQ).

Ben Voigt 2010-07-24 17:27:50

Also, LINQ's join is potentially much worse complexity -- O(M * N) -- than the reflection-based solution I present -- O(M + N + m*n). Where M, N are the total list lengths and m,n are the counts of equivalent elements.

Ben Voigt 2010-07-24 17:29:32

@Ben- Assuming your big-O is correct- and i'm not going to go through the code to see if it is, then yours is also "potentially much worse complexity."

Mike M. 2010-07-24 18:59:54

@Mike: I think mine is "potentially as bad as join", but never worse, and sometimes much better, depending on whether there are a lot of common elements or mostly unique ones.

Ben Voigt 2010-07-24 19:20:44

@Mike M / @Ben Voigt: Interesting side discussion. In the case at hand, LINQ first retrieves all needed properties, caches those and then performs the join, so the method invocations is always M+N. After that, it gets interesting and it really depends on the join. In this scenario it is easy and the join is optimized and fast. But why would you want to do it yourself if for decades they have searched for the best algorithms (and found them). In very specific cases it might still be worth it though. But here, we don't even know what the OP means with "compare list of X to list of Y".

Abel 2010-07-25 11:43:43

You're right, LINQ IEnumerable.Join uses a Dictionary-like structure as well. But the LINQ code is still many times longer and more complicated than mine. Also generating the exceptions during the join should outperform a separate `Except` exclusion.

Ben Voigt 2010-07-25 14:40:58

Answer 5

+2 A:

_{second answer by me, posted as a new answer because it is quite different than my previous, and both have their value}

If we combine the suggestions made in this thread by Ben Voigt, Abel (me), Roman, tster and Mike M., and include my suggestion for caching, we come to the following:

I assume, for the following code, that changing the classes is not an option. I also follow the OP's request of using generics. The code is rudimentary, but should work with minor enhancements. The code is tested and works.

The problem to solve is not the generics, but the reflection, which is, simply put, too slow for this scenario, which is why I show you in the second code block how you can apply manual caching of delegates and how to find the method of a gettor.

Here's how you can apply it, which is pretty straightforward. It works with any class that has a property "T1", but you can use it with any type of member:

// this is how you use the Comparisons class:
List<A> aList = new List<A>();
List<B> bList = new List<B>();
aList.Add(new A("first3"));
aList.Add(new A("duplicate4"));
aList.Add(new A("duplicate1"));
aList.Add(new A("first2"));
bList.Add(new B("second3"));
bList.Add(new B("duplicate4"));
bList.Add(new B("duplicate1"));
bList.Add(new B("second2"));


// get all elements that are in both lists (duplicate1 and duplicate4)
var listDuplicates = Comparisons.GetDuplicatesFromList1(aList, bList);

// remove duplicates (keep "first3" and "first2")
var withoutDuplicates = aList.Except(listDuplicates).ToList();

Below is the actual code. I commented the harder parts. It's not the easiest type of code you'll encounter and to understand or apply it, you need knowledge of reflection, delegates and method invocation techniques.

// all logic goes into this class
public static class Comparisons
{
    // note: static, so don't use in multi-threading environments!
    // must use Delegate as type here, Func<XX, string> would not work, as we cannot possibly know what XX is
    // up front. This is not a problem, as Delegate is the parent of all Func<> and Action<>
    static Dictionary<Type, Delegate> methodLookup = new Dictionary<Type, Delegate>();

    private static Func<T, string> EnsureMethod<T>(T obj)
        where T : class, new()
    {
        Type type = obj.GetType();

        if(!methodLookup.ContainsKey(type))
        {
            // The tricky bit. We cannot use GetProperty here, because we later need a method
            // and we cannot use GetMethod, because it cannot find special methods (hidden gettors)
            MemberInfo[] members = type.GetMember("get_T1");
            if(members == null || members.Length > 1)
                throw new InvalidOperationException("Object must have one 'T1' gettor property");

            MethodInfo property = members[0] as MethodInfo;

            if(property == null)
                throw new InvalidOperationException("Object must have 'T1' property");


            // creating a delegate is the best way to speed up method invocation
            // this type of delegate is called an "open instance delegate", which is like
            // a static delegate with first parameter as the object to invoke on
            Func<T, string> propertyGettor = (Func<T, string>) Delegate.CreateDelegate(typeof(Func<T, string>), null, property);
            methodLookup.Add(type, propertyGettor);
        }

        // must cast here
        return (Func<T, string>)methodLookup[obj.GetType()];

    }

    // I use a generic extension method here. This is frowned upon by some language purists
    // you can always use a utility helper method, which is the alternative
    public static string GetPropertyT1<T>(this T obj)
        where T : class, new()
    {
        // do something with obj1 being null, this is the BCL default
        if (obj == null)
            throw new ArgumentNullException("Extension method object cannot be null for GetT1 method");

        // if the property is not found, an error is raised, so the following is safe:
        // only the first invocation for each type (class) of object is relatively slow
        Func<T, string> delegateObj1 = EnsureMethod(obj);


        // this now is lightning fast: it invokes the method on the instance of obj
        return delegateObj1.Invoke(obj);
    }

    // The actual method that does something, it will return all elements in list1
    // that are also found in list2, replace this with whatever logic you need
    public static IList<U> GetDuplicatesFromList1<U, V>(IEnumerable<U> list1, IEnumerable<V> list2)
        where U: class, new()
        where V: class, new()
    {
        var elementsList1InBoth = from x in list1
                                  join y in list2 on x.GetPropertyT1() equals y.GetPropertyT1()
                                  select x;

        return elementsList1InBoth.ToList();
    }


}


// your original classes as A and B, with no inheritance chain or other relations
public class A
{
    public A(){}
    public A(string value) { this.T1 = value; }
    public string T1 { get; set; }
}
public class B
{
    public B(){}
    public B(string value) { this.T1 = value; }
    public string T1 { get; set; }
    public string Tx { get; set; }
}

Abel 2010-07-25 11:08:43

Looking for the hidden *get_XXX* method is wrong. Use `PropertyInfo.GetGetMethod` (which was part of my answer even before any edits). Also, since you assume all properties are type `string`, you might benefit from using a generic class as your cache instead of a Dictionary... `class MethodLookup<XX> { public static Converter<XX, string> conversion; }`.

Ben Voigt 2010-07-25 14:16:00

I edited my answer to show how to use a generic class as a cache. Also, you don't seem to handle repeated values in the input lists (the lookup field might be repeated while other fields vary).

Ben Voigt 2010-07-25 14:27:32

@Ben Voigt: thanks for pointing that out, I wasn't aware of the GetGetMethod (and thought it was a typo earlier). But I agree that using build-in methods instead of a shortcut is neater. I wanted to keep my solution easy and as simple as possible, yet working. Many more optimizations/refactorings exist and should be applied when the situation warrants that.

Abel 2010-07-25 20:02:11

ansaurus

tags:

views:

answers:

How to compare list of X to list of Y in C# by using generics?

related questions