I'm using an application which uses a number of large dictionaries ( up to 10^6 elements), the size of which is unknown in advance, (though I can guess in some cases). I'm wondering how the dictionary is implemented, i.e. how bad the effect is if I don't give an initial estimate of the dictionary size. Does it internally use a (self-growing) array in the way List does? in which case letting the dictionaries grow might leave a lot of large un-referenced arrays on the LOH.
The best way for me would be to use the .NET Reflector.
http://www.red-gate.com/products/reflector/
Use the disassembled code to see the implementation.
Hashtables normally have something called a load factor, that will increase the backing bucket store if this threshold is reached. IIRC the default is something like 0.72. If you had perfect hashing, this can be increased to 1.0.
Also when the hashtable needs more buckets, the entire collection has to be rehashed.
Using Reflector, I found the following: The Dictionary keeps the data in a struct array. It keeps a count on how many empty places are left in that array. When you add an item and no empty place is left, it increases the size of the internal array (see below) and copies the data from the old array to the new array.
So I would suggest you should use the constructor in which you set the initial size if you know there will be many entries.
EDIT: The logic is actually quite interesting: There is an internal class called HashHelpers
to find primes. To speed this up, it also has stored some primes in a static array from 3 up to 7199369 (some are missing; for the reason, see below). When you supply a capacity, it finds the next prime (same value or larger) from the array, and uses that as initial capacity. If you give it a larger number than in its array, it starts checking manually.
So if nothing is passed as capacity to the Dictionary, the starting capacity is three.
Once the capacity is exceeded, it multiplies the current capacity by two and then finds the next larger prime using the helper class. That is why in the array not every prime is needed, since primes "too close together" aren't really needed.
So if we pass no initial value, we would get (I checked the internal array):
- 3
- 7
- 17
- 37
- 71
- 163
- 353
- 761
- 1597
- 3371
- 7013
- 14591
- 30293
- 62851
- 130363
- 270371
- 560689
- 1162687
- 2411033
- 4999559
Once we pass this size, the next step falls outside the internal array, and it will manually search for larger primes. This will be quite slow. You could initialize with 7199369 (the largest value in the array), or consider if having more than about 5 million entries in a Dictionary might mean that you should reconsider your design.
MSDN says: "Retrieving a value by using its key is very fast, close to O(1), because the Dictionary class is implemented as a hash table." and further on "the capacity is automatically increased as required by reallocating the internal array."
But you get less reallocations if you give an initial estimate. If you have all items from the beginning the LINQ method ToDictionary might be handy.