views:

306

answers:

6

I've been working on a project where I need to iterate through a collection of data and remove entries where the "primary key" is duplicated. I have tried using a

List<int>

and

Dictionary<int, bool>

With the dictionary I found slightly better performance, even though I never need the Boolean tagged with each entry. My expectation is that this is because a List allows for indexed access and a Dictionary does not. What I was wondering is, is there a better solution to this problem. I do not need to access the entries again, I only need to track what "primary keys" I have seen and make sure I only perform addition work on entries that have a new primary key. I'm using C# and .NET 2.0. And I have no control over fixing the input data to remove the duplicates from the source (unfortunately!). And so you can have a feel for scaling, overall I'm checking for duplicates about 1,000,000 times in the application, but in subsets of no more than about 64,000 that need to be unique.

+3  A: 

They have added the HashSet class in .NET 3.5. But I guess it will be on par with the Dictionary. If you have less than say a 100 elements a List will probably perform better.

leppie
A HashSet is exactly what I want, unfortunately we're restricted to .net 2.0, however, using the link @Rob about making Linq work in .net 2.0, I'm trying to get the HashSet working in our environment.
Timothy Carter
A: 

I don't really get what you are asking.

Firstly is just the opposite of what you say. The dictionary has indexed access (is a hash table) while de List hasn't.

If you already have the data in a dictionary then all keys are unique, there can be no duplicates.

I susspect you have the data stored in another data type and you're storing it into the dictionary. If that's the case the inserting the data will work with two dictionarys.

foreach (int key in keys)
{
  if (!MyDataDict.ContainsKey(key))
  {
    if (!MyDuplicatesDict.ContainsKey(key))
      MyDuplicatesDict.Add(key);
  }
  else
    MyDataDict.Add(key); 
}
Jorge Córdoba
+1  A: 

Edit: Nevermind my comment. I thought you're talking about C++. I have no idea if my post is relevant in the C# world..

A hash-table could be a tad faster. Binary trees (that's what used in the dictionary) tend to be relative slow because of the way the memory gets accessed. This is especially true if your tree becomes very large.

However, before you change your data-structure, have you tried to use a custom pool allocator for your dictionary? I bet the time is not spent traversing the tree itself but in the millions of allocations and deallocations the dictionary will do for you.

You may see a factor 10 speed-boost just plugging a simple pool allocator into the dictionary template. Afaik boost has a component that can be directly used.

Another option: If you know only 64.000 entries in your integers exist you can write those to a file and create a perfect hash function for it. That way you can just use the hash function to map your integers into the 0 to 64.000 range and index a bit-array.

Probably the fastest way, but less flexible. You have to redo your perfect hash function (can be done automatically) each time your set of integers changes.

Nils Pipenbrinck
A: 

If you are checking for uniqueness of integers, and the range of integers is constrained enough then you could just use an array.

For better packing you could implement a bitmap data structure (basically an array, but each int in the array represents 32 ints in the key space by using 1 bit per key). That way if you maximum number is 1,000,000 you only need ~30.5KB of memory for the data structure.

Performs of a bitmap would be O(1) (per check) which is hard to beat.

Rob Walker
A: 

There was a question awhile back on removing duplicates from an array. For the purpose of the question performance wasn't much of a consideration, but you might want to take a look at the answers as they might give you some ideas. Also, I might be off base here, but if you are trying to remove duplicates from the array then a LINQ command like Enumerable.Distinct might give you better performance than something that you write yourself. As it turns out there is a way to get LINQ working on .NET 2.0 so this might be a route worth investigating.

Rob
A: 

If you're going to use a List, use the BinarySearch:

 // initailize to a size if you know your set size
List<int> FoundKeys = new List<int>( 64000 );
Dictionary<int,int> FoundDuplicates = new Dictionary<int,int>();

foreach ( int Key in MyKeys )
{
   // this is an O(log N) operation
   int index = FoundKeys.BinarySearch( Key );
   if ( index < 0 ) 
   {
       // if the Key is not in our list, 
       // index is the two's compliment of the next value that is in the list
       // i.e. the position it should occupy, and we maintain sorted-ness!
       FoundKeys.Insert( ~index, Key );
   }
   else 
   {
       if ( DuplicateKeys.ContainsKey( Key ) )
       {
           DuplicateKeys[Key]++;
       }
       else
       {
           DuplicateKeys.Add( Key, 1 );
       }
   } 
}

You can also use this for any type for which you can define an IComparer by using an overload: BinarySearch( T item, IComparer< T > );