views:

485

answers:

4

I have a list of objects. These objects are made up of a custom class that basically contains two string fields (String1 and String2). What I need to know is if any of these strings are duplicated in that list. So I want to know if "objectA.String1 == objectB.String1", or "ObjectA.String2 == ObjectB.String2", or "ObjectA.String1 == ObjectB.String2", or "ObjectA.String2 == ObjectB.String1". Also, I want to mark each object that contains a duplicate string as having a duplicate string (with a bool (HasDuplicate) on the object).

So when the duplication detection has run I want to simply foreach over the list like so:

foreach(var item in duplicationList)
{
  if(item.HasDuplicate == true)
  {
    Console.WriteLine("Duplicate detected!");
  }
}

This seemd like a nice problem to solve with LINQ, but I cannot for the life of me figure out a good query. So I've solved it using 'good-old' foreach, but I'm still interested in a LINQ version.

A: 
var dups = duplicationList.GroupBy(x => x).Where(y => y.Count() > 1).Select(y => y.Key);

foreach (var d in dups)
    Console.WriteLine(d);
mjsabby
I've tested you code in LINQPad using the following program:void Main(){ var duplicationList = new List<TestObject> { new TestObject("1", "2"), new TestObject("3", "4"), new TestObject("1", "6") }; var dups = duplicationList.GroupBy(x => x).Where(y => y.Count() > 1).Select(y => y.Key); dups.Dump("Duplicate dump: " + dups.Count());}public class TestObject{ public TestObject(string s1, string s2) { String1 = s1; String2 = s2; IsDuplicate = false; } public string String1; public string String2; public bool IsDuplicate;}It doesn't work. dups contains 0 values.
Jeroen-bart Engelen
+3  A: 

Here's a complete code sample which should work for your case.

class A
{
 public string Foo { get; set; }
 public string Bar { get; set; }
 public bool HasDupe { get; set; }
}

var list = new List<A> { 
            new A{ Foo="abc", Bar="xyz"}, 
            new A{ Foo="def", Bar="ghi"}, 
            new A{ Foo="123", Bar="abc"}  
           };

var dupes = 
    list.Where( a => 
                list
                .Except( new List<A>{a} )
                .Any( x => x.Foo == a.Foo || x.Bar == a.Bar || x.Foo == a.Bar || x.Bar == a.Foo) 
    ).ToList();

dupes.ForEach(a => a.HasDupe = true);
Winston Smith
Seems to work when I test it in LINQPad. Thanks!
Jeroen-bart Engelen
LINQPad is a great tool for figuring out problems like this - every C# developer should have a copy.
Winston Smith
nice answer! +1
Matias
A: 

First, if your object doesn't have the HasDuplicate property yet, declare an extension method that implements HasDuplicateProperties:

public static bool HasDuplicateProperties<T>(this T instance)
    where T : SomeClass 
    // where is optional, but might be useful when you want to enforce
    // a base class/interface
{
    // use reflection or something else to determine wether this instance
    // has duplicate properties
    return false;
}

You can use that extension method in queries:

var itemsWithDuplicates = from item in duplicationList
                          where item.HasDuplicateProperties()
                          select item;

Same works with the normal property:

var itemsWithDuplicates = from item in duplicationList
                          where item.HasDuplicate
                          select item;

or

var itemsWithDuplicates = duplicationList.Where(x => x.HasDuplicateProperties());
Sander Rijken
That's not my question. I wanted to know how to determine when I have a duplicate so I can set the bool. When the bool is set I know how to get all the objects from the list that have it set.
Jeroen-bart Engelen
+2  A: 

This should work:

public class Foo
{
    public string Bar;
    public string Baz;
    public bool HasDuplicates;
}

public static void SetHasDuplicate(IEnumerable<Foo> foos)
{
    var dupes = foos
        .SelectMany(f => new[] { new { Foo = f, Str = f.Bar }, new { Foo = f, Str = f.Baz } })
        .Distinct() // Eliminates double entries where Foo.Bar == Foo.Baz
        .GroupBy(x => x.Str)
        .Where(g => g.Count() > 1)
        .SelectMany(g => g.Select(x => x.Foo))
        .Distinct()
        .ToList();

    dupes.ForEach(d => d.HasDuplicates = true);    
}

What you are basically doing is

  1. SelectMany : create a list of all the strings, with their accompanying Foo
  2. Distinct : Remove double entries for the same instance of Foo (Foo.Bar == Foo.Baz)
  3. GroupBy : Group by string
  4. Where : Filter the groups with more than one item in them. These contain the duplicates.
  5. SelectMany : Get the foos back from the groups.
  6. Distinct : Remove double occurrences of foo from the list.
  7. ForEach : Set the HasDuplicates property.

Some advantages of this solution over Winston Smith's solution are:

  1. Easier to extend to more string properties. Suppose there were 5 properties. In his solution, you would have to write 125 comparisons to check for duplicates (in the Any clause). In this solution, it's just a matter of adding the property in the first selectmany call.
  2. Performance should be much better for large lists. Winston's solution iterates over the list for each item in the list, while this solution only iterates over it once. (Winston's solution is O(n²) while this one is O(n)).
Geert Baeyaert
does Grouping lazy evaluate its group members? g.Skip(1).Any() might be an improvement over g.Count() > 1
Jimmy
@JimmyIt doesn't really matter in this case, because the groups are not lazily evaluated. I do like the Skip(1).Any() trick though.For my own projects, I always have extensions methods CountIs(int expected), CountIsGreaterThan(int expected)... which stop evaluating as soon as they know the answer.
Geert Baeyaert