views:

117

answers:

4

Hello,

I have a complex LINQ query (using LINQ 2 EF) that can return duplicate results and I'm thus using the .Distinct() method to avoid duplicates. Here's the skeleton:

var subQuery1 = // one query...
var subQuery2 = // another query...
var result = subQuery1.Distinct().Union( subQuery2.Distinct() ).ToArray();

Each of the sub queries join a common user table with another table and perform a 'where' query, the results are later combined in the .Union(...). This worked fine until the table was modified to include an XML column, which results in this exception:

the xml data type cannot be selected as distinct because it is not comparable

In this case I don't care if the XML column is equivalent across the results. actually I only need to be assured that the primary key UserId is distinct in the results.

Is there a way to use Distinct() but ignore the XML column or a simpler way to assure that I remove records from the result with the same UserId in an efficient way? Ideally this would not retrieve duplicate records from the database and would not require post-processing to remove the duplicates.

Update: I've found out that if I serialize my queries to arrays ahead of time then there is no need for any kind of comparer since Linq2Objects doesn't have the XML distinct selection issue. For example I can do this:

var subQuery1 = // one query...
var subQuery2 = // another query...
var result = 
   subQuery1.Distinct().ToArray().Union( 
      subQuery2.Distinct().ToArray() )
   .ToArray();

So what I'm really looking for is a way to avoid serializing the intermediate queries and do a Linq2Entities call directly that will not fetch records with duplicate UserIds. Thanks for all the answers thus far.

A: 

This extension method should return a list of items with only the first item from each set of duplicates in it...

public static IEnumerable<Tsource> RemoveDuplicates<Tkey, Tsource>(this IEnumerable<Tsource> source, Func<Tsource, Tkey> keySelector)
{
    var hashset = new HashSet<Tkey>();
    foreach (var item in source)
    {
        var key = keySelector(item);
        if (hashset.Add(key))
            yield return item;
    }
}

it would be used on a list like this list.RemoveDuplicates(x => x.UserID). If there were two records in List with the same userID, it would only return the first

TerrorAustralis
+3  A: 

Write an IEqualityComparer<T> implementation for the object which contains your XML type and pass it to Distinct. In the Equals method you can implement equality semantics however you wish.

This is a handy T4 code-generation template I wrote myself for generating IEqualityComparer<T> implementations for my team's domain models:

<#@ template language="C#v3.5" debug="True" #>
<#@ output extension=".generated.cs" #>
<#
    var modelNames = new string[] {
        "ClassName1",
        "ClassName2",
        "ClassName3",
    };

    var namespaceName = "MyNamespace";
#>
using System;
using System.Collections.Generic;

namespace <#= namespaceName #>
{
<#
    for (int i = 0; i < modelNames.Length; ++i)
    {
        string modelName = modelNames[i];
        string eqcmpClassName = modelName + "ByIDEqualityComparer";
#>
    #region <#= eqcmpClassName #>

    /// <summary>
    /// Use this EqualityComparer class to determine uniqueness among <#= modelName #> instances
    /// by using only checking the ID property.
    /// </summary>
    [System.Diagnostics.DebuggerNonUserCode]
    public sealed partial class <#= eqcmpClassName #> : IEqualityComparer<<#= modelName #>>
    {
        public bool Equals(<#= modelName #> x, <#= modelName #> y)
        {
            if ((x == null) && (y == null)) return true;
            if ((x == null) || (y == null)) return false;

            return x.ID.Equals(y.ID);
        }

        public int GetHashCode(<#= modelName #> obj)
        {
            if (obj == null) return 0;

            return obj.ID.GetHashCode();
        }
    }

    #endregion
<#
        if (i < modelNames.Length - 1) WriteLine(String.Empty);
    } // for (int i = 0; i < modelNames.Length; ++i)
#>
}

It makes the assumption that each of your model classes have a property named "ID" which is the primary key, stored as something that implements Equals. Our convention forces all of our models to have this property. If your models all have differently-named ID properties, consider modifying either this T4 template to suit your needs or better yet, make life easier on yourself (not just for the sake of using this T4) and change your models to use the "ID" name.

James Dunne
The T4 template sure is handy, but to use the IEqualityComparer I have to 1st serailize both my queries to arrays (as linq2entities doesn't support the comparer) before I can remove duplicates. Nevertheless, its something that works for the time being, thanx!
TJB
@TJB: Ah. I don't have any experience with LINQ-to-entities. I just use the `IEqualityComparer<T>` for LINQ-to-objects on in-memory collections for a quick distinct-by-ID. Glad to be of service! T4 templates rule :)
James Dunne
+2  A: 

as James Dunne said, you'd want to use an IEqualityComparer

a quick mock up would be something like this. You'll need to replace "ObjectType" with whatever type is in your subQuery1 and subQuery2 of course. please note this is untested:

List<ObjectType> listQueries = new List<ObjectType>();

ObjectTypeEqualityComparer objectTypeComparer = new ObjectTypeEqualityComparer();

listQueries.AddRange(subQuery1);// your first query
listQueries.AddRange(subQuery2); // your second query
ObjectType[] result = listQueries.Distinct(objectTypeComparer).ToArray();


class ObjectTypeEqualityComparer : IEqualityComparer<ObjectType>
{
    public bool Equals(ObjectType obj1, ObjectType obj2)
    {
        return obj1.UserId == obj2.UserId ?  true : false;
    }

    public int GetHashCode(ObjectType obj)
    {
        return obj.UserId.GetHashCode();
    }

}
jb
Piggyback answer, eh? :) I was considering going back and updating my answer to include an example, but yours suffices just as well.
James Dunne
haha, in my defense, i was gonna write this out before i saw your answer :)
jb
+1  A: 

You could use morelinq's DistinctBy. I suspect (but have not verified) that this, as well as the IEqualityComparer and RemoveDuplicates answers, will retrieve the duplicate records from SQL Server and then remove the duplicates on the client. If someone provides a server-side solution, I'd recommend accepting their answer.

TrueWill
Right, I'm looking for something that can achieve the 'distinct' in sql rather than in post-processing
TJB