tags:

views:

317

answers:

10

Simple situation. I have a list of lists, almost table like, and I am trying to find out if any of the lists are duplicated.

Example:

List<List<int>> list = new List<List<int>>(){
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
  new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
};

I would like to know that there are 4 total items, 2 of which are duplicates. I was thinking about doing something like a SQL checksum but I didn't know if there was a better/easier way.

I care about performance, and I care about ordering.

Additional Information That May Help

  • Things inserted into this list will never be removed
  • Not bound to any specific collection.
  • Dont care about function signature
  • They type is not restricted to int
+2  A: 

You will have to iterate through each index of each list at least once, but you can potentially speed up the process by creating a custom hash table, so that you can quickly reject non-duplicate lists without having to do comparisons per-item.

Algorithm:

Create a custom hashtable (dictionary: hash -> list of lists)
For each list
  Take a hash of the list (one that takes order into account)
  Search in hashtable
  If you find matches for the hash
    For each list in the hash entry, re-compare the tables
      If you find a duplicate, return true
  Else if you don't find matches for the hash
    Create a temp list
    Append the current list to our temp list
    Add the temp list to the dictionary as a new hash entry
You didn't find any duplicates, so return false

If you have a strong enough hashing algorithm for your input data, you might not even have to do the sub-comparisons, since there wouldn't be any hash collisions.

I have some example code. The missing bits are:

  • An optimization so that we do the dictionary lookup only once per list (for search and insert). Might have to make your own Dictionary/Hash Table class to do this?
  • A better hashing algorithm that you find by profiling a bunch of them against your data

Here is the code:

public bool ContainsDuplicate(List<List<int>> input)
{
    var encounteredLists = new Dictionary<int, List<EnumerableWrapper>>();

    foreach (List<int> currentList in input)
    {
        var currentListWrapper = new EnumerableWrapper(currentList);
        int hash = currentListWrapper.GetHashCode();

        if (encounteredLists.ContainsKey(hash))
        {
            foreach (EnumerableWrapper currentEncounteredEntry in encounteredLists[hash])
            {
                if (currentListWrapper.Equals(currentEncounteredEntry))
                    return true;
            }
        }
        else
        {
            var newEntry = new List<EnumerableWrapper>();
            newEntry.Add(currentListWrapper);
            encounteredLists[hash] = newEntry;
        }
    }

    return false;
}

sealed class EnumerableWrapper
{
    public EnumerableWrapper(IEnumerable<int> list)
    {
        if (list == null)
            throw new ArgumentNullException("list");
        this.List = list;
    }

    public IEnumerable<int> List { get; private set; }

    public override bool Equals(object obj)
    {
        bool result = false;

        var other = obj as EnumerableWrapper;
        if (other != null)
            result = Enumerable.SequenceEqual(this.List, other.List);

        return result;
    }

    public override int GetHashCode()
    {
        // Todo: Implement your own hashing algorithm here
        var sb = new StringBuilder();
        foreach (int value in List)
            sb.Append(value.ToString());
        return sb.ToString().GetHashCode();
    }
}
Merlyn Morgan-Graham
+1  A: 

Here's a potential idea (this assumes that the values are numerical):

Implement a comparer that multiplies each member of each collection by its index, then sum the whole thing:

Value:    0  5  8  3  2  0  5  3  5  1
Index:    1  2  3  4  5  6  7  8  9  10
Multiple: 0  10 24 12 10 0  35 24 45 10

Member CheckSum: 170

So, the whole "row" has a number that changes with the members and ordering. Fast to compute and compare.

Dave Swersky
Counterexample: { 3, 2, 1 }, { 1, 2, 3 },
Merlyn Morgan-Graham
I figured there was a possibility of collisions. Perhaps it would be fast to also check for a mirror list like this to be sure?
Dave Swersky
@Dave: Sounds reasonable, though it might still be a rabbit hole. You might still have collisions, and you still have to deal w/ int overflow, for example. You could use it to compute a hash value, of course, then do re-compares upon collisions.
Merlyn Morgan-Graham
+1  A: 

if they are all single digit and have the same number of elements you can put them together so the first one is 123456 and check if the numbers are the same.

then you would have a list {123456, 123456, 142456, 325164}

which is easier to check for duplicates, if the individual members can be more than 10 you would have to modify this.

Edit: added sample code, can be optimize, just a quick example to explain what I meant.

for(int i = 0; i< list.length; i++)
{
    List<int> tempList = list[i];
    int temp = 0;
    for(int j = tempList.length - 1;i > = 0; j--)
    {
        temp = temp * 10 + tempList[j];
    }
    combinded.add(temp);
}

for(int i =0; i< combined.length; i++)
{
    for(int j = i; j < combined.length; j++)
    {
        if(combined[i] == combined[j])
        {
            return true;
        }
    }
}
return false;
182764125216
too suspicious assumption. and if there are more than 9 numbers?
Andrey
What if you have varying list sizes? You might be comparing items from different lists
Merlyn Morgan-Graham
@Andrey, it doesn't matter if there are more than 9 digits. if any individual digit is larger than 9 you would have to modify this slightly. and yes it is assuming a lot but from the examples they are all less than 9 and have the same amount of items.
182764125216
@182764125216 it does. 9 digits is what int(32) can take. int64 more, but still, you limit yourself. hashcode should be fixed length
Andrey
@Andrey good point, thanks for the critique of my code.
182764125216
+1  A: 

You could also try probabilistic algorithms if duplicates are either very rare or very common. e.g. a bloom filter

Conrad Frix
A: 

Check out http://stackoverflow.com/questions/493673/c-3-0-need-to-return-duplicates-from-a-list it shows you how to return duplicates from the list.

Example from that page:

var duplicates = from car in cars
             group car by car.Color into grouped
             from car in grouped.Skip(1)
             select car;
Gage
What is your "group by" for an array?
Merlyn Morgan-Graham
And what is the algorithmic complexity of this?
Merlyn Morgan-Graham
As i said not my algorithm...its from the other page...And the group by is for a list (LINQ)
Gage
+1  A: 

What about that writing your own list comparer:

class ListComparer:IEqualityComparer<List<int>>
{
     public bool Equals(List<int> x, List<int> y)
     {
        if(x.Count != y.Count)
          return false;

        for(int i = 0; i < x.Count; i++)
          if(x[i] != y[i])
             return false;

       return true;
     }

     public int GetHashCode(List<int> obj)
     {
        return base.GetHashCode();
     }
}

and then just:

var nonDuplicatedList = list.Distinct(new ListComparer());
var distinctCount = nonDuplicatedList.Count();
ŁukaszW.pl
Your code will run for O(n^2*m). if you first calculate hash code then sort then go over the list you will get O(n*m + nlogn + n) which is much better.
Andrey
This isn't the problem he was trying to solve. He was trying to determine if the list had duplicated, not what the unique set is.
Merlyn Morgan-Graham
`I would like to know that there are 4 total items, 2 of which are duplicates` for me means, that he want to get at least count of distinct items..
ŁukaszW.pl
I guess you can subtract the distinct count from the input count. Please add that to your solution, then :) He should probably clarify exactly what he wants, though, because he also implied he wanted a boolean value by saying `I am trying to find out if any of the lists are duplicated`. One might or might not be an easier problem than the other, and one is certainly faster than the other.
Merlyn Morgan-Graham
I am trying to find out the number of distinct lists, in an efficient way.
Nix
Good solution. This code can run O(n^2*m) as Andrey says, but it gets faster when there are less duplicates, and when there are no lists with the same number of elements then it's O(n*m). Plus it doesn't involve generating hashcodes; the comparison of the list elements is going to be nicer than comparison of hashcodes. And *finally*, the questioner notes that the list type is not restricted to int - and this solution is very easily adapted to other types.
Kirk Broadhurst
+6  A: 

Let's try to get best performace. if n is number of lists and m is length of lists then we can get O(n*m + n*logn + n) plus some probability of hash codes to be equal for different lists.

Major steps:

  1. Calculate hash codes*
  2. Sort them
  3. Go over list to find dupes

* this is important step. for simlicity you can calc hash as = ... ^ (list[i] << i) ^ (list[i + 1] << (i + 1))

Edit for those people that think that PLINQ can boost the thing, but not good algorythm. PLINQ can also be added here, because all the steps are easily parallelizable.

My code:

static public void Main()
{
    List<List<int>> list = new List<List<int>>(){
      new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
      new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
      new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
      new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
    };
    var hashList = list.Select((l, ind) =>
    {
        uint hash = 0;
        for (int i = 0; i < l.Count; i++)
        {
            uint el = (uint)l[i];
            hash ^= (el << i) | (el >> (32 - i));
        }
        return new {hash, ind};
    }).OrderBy(l => l.hash).ToList();
    //hashList.Sort();
    uint prevHash = hashList[0].hash;
    int firstInd = 0;            
    for (int i = 1; i <= hashList.Count; i++)
    {
        if (i == hashList.Count || hashList[i].hash != prevHash)
        {
            for (int n = firstInd; n < i; n++)
                for (int m = n + 1; m < i; m++)
                {
                    List<int> x = list[hashList[n].ind];
                    List<int> y = list[hashList[m].ind];
                    if (x.Count == y.Count && x.SequenceEqual(y))
                        Console.WriteLine("Dupes: {0} and {1}", hashList[n].ind, hashList[m].ind);
                }                    
        }
        if (i == hashList.Count)
            break;
        if (hashList[i].hash != prevHash)
        {
            firstInd = i;
            prevHash = hashList[i].hash;
        }
    }
}
Andrey
I like this slightly better than mine (Dictionary<hash value, List<List<int>>>), because it requires less storage, and makes the algorithmic complexity very obvious. Implement it in C#, and I think we'll have a winner :)
Merlyn Morgan-Graham
Nice overall, but you need a better way of creating hash values for a general solution. That one has all sorts of likely collisions, and collides more and more the further you get into the list.
Rex Kerr
@Rex Kerr totally agree. please suggest better hash function if you know any for this case.
Andrey
Assuming the inner lists' possible items are non-negative integers with a not-too-large maximum, one could try the following hash function (almost certainly requiring an arbitrary-precision arithmetic library like GMP due to the enormous hash values likely to occur): for each inner list, hash to the product of sequential primes raised to the power of the corresponding element in the list. For example, the list (0, 1, 2, 3, 4, 5, 6) would hash to 2^0 * 3^1 * 5^2 * 7^3 * 11^4 * 13^5 * 17^6 = 3,375,486,799,005,529,032,825. Clunky, yes, but it is injective :-)
William
@Will this is too much. these arbitrary long integers are represented as arrays themselves. Hash must be fixed length.
Andrey
Anything of the form `x(i)*p^(n-i)` is a good choice, with `p` a non-tiny prime. It can be computed easily either iteratively or recursively: `hash(n+1) = hash(n)*p + x(n+1)`.
Rex Kerr
@Rex Kerr is there any theory behind it? i would like to get the details
Andrey
@Andrey - I'm not sure where the theory is written down, but it goes something like this: Small values are common, so you should avoid having 2,2 have the same hash as 4,1. The best way to do this is to make small numbers large (e.g. by multiplication; if you shift, and the number happens to be large, you throw away the large bits, while if you multiply it makes a difference to how it ends up wrapping into 32 bits). Then, you also don't want long lists to wrap back on themselves, which means you want to avoid multiplying by factors of 2. A non-tiny prime fulfills both criteria.
Rex Kerr
+3  A: 

Unless you're doing some seriously heavy lifting, perhaps the following simple code will work for you:

var lists = new List<List<int>>()
{
   new List<int>() {0 ,1, 2, 3, 4, 5, 6 },
   new List<int>() {0 ,1, 2, 3, 4, 5, 6 },
   new List<int>() {0 ,1, 4, 2, 4, 5, 6 },
   new List<int>() {0 ,3, 2, 5, 1, 6, 4 }
};

var duplicates = from list in lists
                 where lists.Except(new[] { list }).Any(l => l.SequenceEqual(list))
                 select list;

Obviously you could get better performance if you hand-tweak an algorithm such that you don't have to scan the lists each iteration, but there is something to be said for writing declarative, simpler code.

(Plus, thanks to the Awesomeness of LINQ®, by adding a .AsParallel() call to the above code, the algorithm will run on multiple cores, thus running potentially faster than the complex, hand-tweaked solutions mentioned in this thread.)

Judah Himango
`thus could potentially run faster than the complex, hand-tweaked solutions posted in this thread`. This depends on the size of the lists. Did you check your algorithmic complexity? I haven't, but if it is N^2, then you might have to increase your processing power exponentially to keep pace. I guess Moore's law would have you covered for now, but still :) For real-world scenarios, though, there will of course be an upper cap to the size of these lists
Merlyn Morgan-Graham
Praise Be To Moore's Law and the multicore revolution. :-)
Judah Himango
@Judah Himango we sould work on algorythm improvements, not prase Moore and others. when people follow this way we have better hardware and **still** slow software, because no one makes it better
Andrey
your algorythm comlexity is O(n*n*m) which is really **awful** when we can shoot O(n*m + nlogn +n). And assume now you have server application that run that code for different clients simultaneusly. Will PLINQ help you? Sorry, no, it will make it run slower.
Andrey
I made it very clear in my post that if you're doing heavy lifting (doing this in tight loop, or on big data sets), you might be better off with a hand-tweaked algorithm. But for small lists, this will work just fine, and indeed may even run faster than some of the hand-tweaked algorithms due to it being easy to parallelize.
Judah Himango
Also note that this might actually run faster than hand-tweaked algorithms. Why? Because those hand-tweaked algorithms are tied down to 1 core. My simple code here can run on the 8 cores I've got right now, and will only continue to run faster as more cores are introduced.
Judah Himango
+2  A: 

Something like this will give you the correct results:

List<List<int>> list = new List<List<int>>(){
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
  new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
};

list.ToLookup(l => String.Join(",", l.Select(i => i.ToString()).ToArray()))
    .Where(lk => lk.Count() > 1)
    .SelectMany(group => group);
theburningmonk
String joins, eh? Hackish, I say! :-)
Judah Himango
just a weeee bit :-P well, gets the right result, though I'd admit your solution's neater!
theburningmonk
@theburningmonk: You don't need the "," in your string join.
Merlyn Morgan-Graham
@theburningmonk: Instead of join, then toarray, you might be able to come up with a solution that can return an ienumerable for each integer->string conversion, then your storage requirements might drop to be only one int-sized string.
Merlyn Morgan-Graham
@Merlyn - there's no overloaded String.Join which doesn't require the delimiter though, besides I'd need the values to be separated somehow otherwise these two will get me the same key in the lookup { 1, 0, 2, 3 }, { 10, 2, 3 }
theburningmonk
@theburningmonk: That's true, you have to resolve cases like the one you described. I also just realized that you create an array, then a string, not the other way around. So you can disregard both my comments :) Although, for the record, `string.Join("", array)` will work.
Merlyn Morgan-Graham
+1  A: 

There are a number of good solutions here already, but I believe this one will consistently run the fastest unless there is some structure of the data that you haven't yet told us about.

  • Create a map from integer key to List, and a map from key to List<List<int>>
  • For each List<int>, compute a hash using some simple function like (...((x0)*a + x1)*a + ...)*a + xN) which you can calculate recursively; a should be something like 1367130559 (i.e. some large prime that is randomly non-close to any interesting power of 2.
  • Add the hash and the list it comes from as a key-value pair, if it does not exist. If it does exist, look in the second map. If the second map has that key, append the new List<int> to the accumulating list. If not, take the List<int> you looked up from the first map and the List<int> you were testing, and add a new entry in the second map containing a list of those two items.
  • Repeat until you've passed through your entire first list. Now you have a hashmap with a list of potential collisions (the second map), and a hashmap with a list of keys (the first map).
  • Iterate through the second map. For each entry, take the List<List<int>> therein and sort it lexicographically. Now just walk through doing equality comparisons to count the number of different blocks.
  • Your total number of items is equal to the length of your original list.
  • Your number of distinct items is equal to the size of your first hashmap plus the sum of (number of blocks - 1) for each entry in your second hashmap.
  • Your number of duplicate items is the difference of those two numbers (and you can find out all sorts of other things if you want).

If you have N non-duplicate items, and M entries which are duplicates from a set of K items, then it will take you O(N+M+2K) to create the initial hash maps, at the very worst O(M log M) to do the sorting (and probably more like O(M log(M/K))), and O(M) to do the final equality test.

Rex Kerr