views:

49

answers:

2

I am looking into the code behind Dictionary<TKey, TValue>. What is interesting is in the private Insert method, there is a bucket that appears to be holding empty slots in a pre-sized array. Inside the Insert method, the code checks to see if the bucket has any elements left and will resize if necessary. The number of elements added is a factor of a prime number. Also, dictionary entry properties are stored in a struct with hashcode, key, and value.

My question: what is the purpose? Is this done to prevent trying to add items to the dictionary object when sufficient memory might not be available?

NOTE: I didn't want to paste any of the code here since it requires disassembling to read.

+1  A: 

Every time the collection needs to be resized, it causes a bit of thrashing on the heap that takes some time. These 'empty slots' are initialized to prevent that.

There are constructors for most collections that let you specify the initial sizes, initial 'potential' sizes, and growth factors. Specifying the exact size if you know it is the best thing to do as far as this is concerned.

Andrew Barber
ah...that makes sense - apparently, the resizing is progressive *and* a factor of a prime number (prime * 2) - I suppose this is another interesting aspect and wonder why it is a prime used to resize...
dboarman
Yeah; I think there's something statistically about how they use a prime number to increase the size, and the possible projected increases in size in 'typical' use, or something like that. heh
Andrew Barber
+1  A: 

The Dictionary<TKey,TValue> object is not adding new empty values with this approach. What it's doing is pre-allocating a backing storage for the data it will later be asked to add. The end goal being that the average insert case does not require an allocation to complete. Instead it finds a slot in the existing bucket array to place itself.

The reasons for some of the other items you mentioned like prime numbers and hash code are properties common to most hashtable style implementations. Instead of going over them each here i'm going to point you to the wikipedia article on the subject

JaredPar
According to wiki: *For separate-chaining, the worst-case scenario is when all entries were inserted into the same bucket, in which case the hash table is ineffective and the cost is that of searching the bucket data structure. If the latter is a linear list, the lookup procedure may have to scan all its entries; so the worst-case cost is proportional to the number n of entries in the table.* Am I wrong or is MS using this method?
dboarman