views:

183

answers:

5

Given a collection of records like this:

string ID1;
string ID2;
string Data1;
string Data2;
// :
string DataN

Initially Data1..N are null, and can pretty much be ignored for this question. ID1 & ID2 both uniquely identify the record. All records will have an ID2; some will also have an ID1. Given an ID2, there is a (time-consuming) method to get it's corresponding ID1. Given an ID1, there is a (time-consuming) method to get Data1..N for the record. Our ultimate goal is to fill in Data1..N for all records as quickly as possible.

Our immediate goal is to (as quickly as possible) eliminate all duplicates in the list, keeping the one with more information.

For example, if Rec1 == {ID1="ABC", ID2="XYZ"}, and Rec2 = {ID1=null, ID2="XYZ"}, then these are duplicates, --- BUT we must specifically remove Rec2 and keep Rec1.

That last requirement eliminates the standard ways of removing Dups (e.g. HashSet), as they consider both sides of the "duplicate" to be interchangeable.

+4  A: 

How about you split your original list into 3 - ones with all data, ones with ID1, and ones with just ID2.

Then do:

var unique = allData.Concat(id1Data.Except(allData))
                    .Concat(id2Data.Except(id1Data).Except(allData));

having defined equality just on the basis of ID2.

I suspect there are more efficient ways of expressing that, but the fundamental idea is sound as far as I can tell. Splitting the initial list into three is simply a matter of using GroupBy (and then calling ToList on each group to avoid repeated queries).

EDIT: Potentially nicer idea: split the data up as before, then do:

var result = new HashSet<...>(allData);
result.UnionWith(id1Data);
result.UnionWith(id2Data);

I believe that UnionWith keeps the existing elements rather than overwriting them with new but equal ones. On the other hand, that's not explicitly specified. It would be nice for it to be well-defined...

(Again, either make your type implement equality based on ID2, or create the hash set using an equality comparer which does so.)

Jon Skeet
A: 

I had a similar issue a couple of months ago.

Try something like this...

public static List<T> RemoveDuplicateSections<T>(List<T> sections) where T:INamedObject
        {
            Dictionary<string, int> uniqueStore = new Dictionary<string, int>();
            List<T> finalList = new List<T>();
            int i = 0;
            foreach (T currValue in sections)
            {
                if (!uniqueStore.ContainsKey(currValue.Name))
                {
                    uniqueStore.Add(currValue.Name, 0);
                    finalList.Add(sections[i]);
                }
                i++;
             }
            return finalList;
        }
xxmrlnxx
A: 

This may smell quite a bit, but I think a LINQ-distinct will still work for you if you ensure the two compared objects come out to be the same. The following comparer would do this:

private class Comp : IEqualityComparer<Item>
    {
      public bool Equals(Item x, Item y)
      {
        var equalityOfB = x.ID2 == y.ID2;
        if (x.ID1 == y.ID1 && equalityOfB)
          return true;
        if (x.ID1 == null && equalityOfB)
        {
          x.ID1 = y.ID1;
          return true;
        }
        if (y.ID1 == null && equalityOfB)
        {
          y.ID1 = x.ID1;
          return true;
        }
        return false;
      }

      public int GetHashCode(Item obj)
      {
        return obj.ID2.GetHashCode();
      }
    }

Then you could use it on a list as such...

var l = new[] { 
  new Item { ID1 = "a", ID2 = "b" }, 
  new Item { ID1 = null, ID2 = "b" } };
var l2 = l.Distinct(new Comp()).ToArray();
flq
A: 
records.GroupBy(r => r, new RecordByIDsEqualityComparer())
       .Select(g => g.OrderByDescending(r => r, new RecordByFullnessComparer()).First())

or if you want to merge the records, then Aggregate instead of OrderByDescending/First.

Andrey Shchekin
A: 

You haven't mentioned whether C# 2.0 or 3.0 ...

I'll give you the C# 2.0 version and you can shorten it with linq if you like. Linq would look much nicer since you could replace all the "delegate"s with lambdas

(I did this to help me learn too - nice data problem)

If your collection was a generic List for example:

class Program {

    public class Record {
        public string ID1;
        public string ID2;
        public string Data1;
        public string Data2;
        public string DataN;
        public override string ToString() {
            return string.Format("ID1 = {0}, ID2 = {1}", ID1, ID2);
        }
    }

    static void Main(string[] args) {

        // Test data
        List<Record> collection = new List<Record>() { 
                new Record() { ID1 = null, ID2 = "x.2" },
                new Record() { ID1 = "x.1", ID2 = "x.2" },
                new Record() { ID1 = "z.1", ID2 = "z.2" },
                new Record() { ID1 = null, ID2 = "s.2" },
                new Record() { ID1 = null, ID2 = "a.2" },
                new Record() { ID1 = "w.1", ID2 = "w.2" },
                new Record() { ID1 = null, ID2 = "w.2" }
        };

        // List of records with no ID1 (low priority data)
        List<Record> emptyID1 = collection.FindAll(delegate(Record r) {
                                   return r.ID1 == null ? true : false;
                               });

        // Temporarily remove all low-priority data
        collection.RemoveAll(delegate(Record r) {
            return r.ID1 == null ? true : false;
        });

        // Check whether we lost any ID2 records when
        // removing low priority data and put ID2's back
        emptyID1.ForEach(delegate(Record emptyIdRecord) {
            if (collection.Find( delegate(Record r) {
                        return r.ID2 == emptyIdRecord.ID2 ? true : false;
            }) == null) {
                collection.Add(emptyIdRecord);
            }
        });

        // Display results
        collection.ForEach( delegate (Record r)
                                {
                                    Console.WriteLine(r.ToString());
                                }
            );

        Console.ReadLine();
    }
}
Neil Fenwick