ansaurus

Question

Performance when checking for duplicates

Answer 1

+3 A:

They have added the HashSet class in .NET 3.5. But I guess it will be on par with the Dictionary. If you have less than say a 100 elements a List will probably perform better.

leppie 2008-09-18 12:12:21

A HashSet is exactly what I want, unfortunately we're restricted to .net 2.0, however, using the link @Rob about making Linq work in .net 2.0, I'm trying to get the HashSet working in our environment.

Timothy Carter 2008-09-19 11:12:43

Answer 2

A:

I don't really get what you are asking.

Firstly is just the opposite of what you say. The dictionary has indexed access (is a hash table) while de List hasn't.

If you already have the data in a dictionary then all keys are unique, there can be no duplicates.

I susspect you have the data stored in another data type and you're storing it into the dictionary. If that's the case the inserting the data will work with two dictionarys.

foreach (int key in keys)
{
  if (!MyDataDict.ContainsKey(key))
  {
    if (!MyDuplicatesDict.ContainsKey(key))
      MyDuplicatesDict.Add(key);
  }
  else
    MyDataDict.Add(key); 
}

Jorge Córdoba 2008-09-18 12:16:02

Answer 3

+1 A:

Edit: Nevermind my comment. I thought you're talking about C++. I have no idea if my post is relevant in the C# world..

A hash-table could be a tad faster. Binary trees (that's what used in the dictionary) tend to be relative slow because of the way the memory gets accessed. This is especially true if your tree becomes very large.

However, before you change your data-structure, have you tried to use a custom pool allocator for your dictionary? I bet the time is not spent traversing the tree itself but in the millions of allocations and deallocations the dictionary will do for you.

You may see a factor 10 speed-boost just plugging a simple pool allocator into the dictionary template. Afaik boost has a component that can be directly used.

Another option: If you know only 64.000 entries in your integers exist you can write those to a file and create a perfect hash function for it. That way you can just use the hash function to map your integers into the 0 to 64.000 range and index a bit-array.

Probably the fastest way, but less flexible. You have to redo your perfect hash function (can be done automatically) each time your set of integers changes.

Nils Pipenbrinck 2008-09-18 12:17:43

Answer 4

A:

If you are checking for uniqueness of integers, and the range of integers is constrained enough then you could just use an array.

For better packing you could implement a bitmap data structure (basically an array, but each int in the array represents 32 ints in the key space by using 1 bit per key). That way if you maximum number is 1,000,000 you only need ~30.5KB of memory for the data structure.

Performs of a bitmap would be O(1) (per check) which is hard to beat.

Rob Walker 2008-09-18 12:21:50

Answer 5

A:

There was a question awhile back on removing duplicates from an array. For the purpose of the question performance wasn't much of a consideration, but you might want to take a look at the answers as they might give you some ideas. Also, I might be off base here, but if you are trying to remove duplicates from the array then a LINQ command like Enumerable.Distinct might give you better performance than something that you write yourself. As it turns out there is a way to get LINQ working on .NET 2.0 so this might be a route worth investigating.

Rob 2008-09-18 12:26:30

Answer 6

A:

If you're going to use a List, use the BinarySearch:

 // initailize to a size if you know your set size
List<int> FoundKeys = new List<int>( 64000 );
Dictionary<int,int> FoundDuplicates = new Dictionary<int,int>();

foreach ( int Key in MyKeys )
{
   // this is an O(log N) operation
   int index = FoundKeys.BinarySearch( Key );
   if ( index < 0 ) 
   {
       // if the Key is not in our list, 
       // index is the two's compliment of the next value that is in the list
       // i.e. the position it should occupy, and we maintain sorted-ness!
       FoundKeys.Insert( ~index, Key );
   }
   else 
   {
       if ( DuplicateKeys.ContainsKey( Key ) )
       {
           DuplicateKeys[Key]++;
       }
       else
       {
           DuplicateKeys.Add( Key, 1 );
       }
   } 
}

You can also use this for any type for which you can define an IComparer by using an overload: BinarySearch( T item, IComparer< T > );

2008-09-18 16:39:55

ansaurus

tags:

views:

answers:

Performance when checking for duplicates

related questions