views:

522

answers:

5

I am trying to optimize a piece of .NET 2.0 C# code that looks like this:

Dictionary<myType, string> myDictionary = new Dictionary<myType, string>();
// some other stuff
// inside a loop check if key is there and if not add element
if(!myDictionary.ContainsKey(currentKey))
{
   myDictionary.Add(currentKey, "");
}

Looks like the Dictionary has been used by whoever wrote this piece of code even if not needed (only the key is being used to store a list of unique values) because faster than a List of myType objects for search. This seems obviously wrong as only the key of the dictionary but I am trying to understand what's the best way to fix it.

Questions:

1) I seem to understand I would get a good performance boost even just using .NET 3.5 HashSet. Is this correct?

2) What would be the best way to optimize the code above in .NET 2.0 and why?

EDIT: This is existing code I am trying to optimize, it's looping through dozens of thousands items and for each one of them is calling a ContainsKey. There's gotta be a better way of doing it (even in .NET 2.0)! :)

+8  A: 

I think you need to break this down into 2 questions

Is Dictionary<myType,string> the best available type for this scenario

No. Based on your breakdown, HashSet<myType> is clearly the better choice because it's usage pattern more accurately fits the scenario

Will switching to Hashset<myType> give me a performance boost?

This is really subjective and only a profiler can give you the answer to this question. Likely you'll see a very minor memory size improvement per element in the collection. But in terms of raw computing power I doubt you'll see a huge difference. Only a profiler can tell you if there is one.

Before you ever make a performance related change to your code remember the golden rule.

Don't make any performance related changes until a profiler has told you precisely what is wrong with your code.

Making changes which violate this rule are just guesses. A profiler is the only way to measure success of a performance fix.

JaredPar
+1 but the last sentence should be the big bold one :)
Rex M
@Rex, I made it a bit bigger :)
JaredPar
+1 Agree. Last sentence is the most important part of the post
zebrabox
If this is .NET 2.0, then `Dictionary<TKey, TValue>` is probably the best choice because `HashSet<T>` was added only in .NET 3.5.
0xA3
+1 for profiling before optimizing!
TrueWill
+2  A: 

1) No. A dictionary does a hash on the key so your lookup should be O(1). A Hashset should result in less memory needed though. But honestly, it isn't that much that you will really see a performance boost.

2) Give us some more detail as to what you are trying to accomplish. The code you posted is pretty simple. Have you measured yet? Are you seeing that this method is slow? Don't forget "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil." -- Donald Knuth

Bob
Thanks - I am aware of premature optimization being bad. This is existing code I am trying to optimize, it's looping through dozens of thousands items and for each one of them is calling a ContainsKey. There's gotta be a bettere way of doing it! :)
JohnIdol
+2  A: 
  1. Depending on the size of your keys, you may actually see performance degrade.

  2. One way in 2.0 would be to try and insert it and catch the exception (of course, this depends on how many duplicate keys you plan on having:

foreach(string key in keysToAdd)
{
  try
  {
    dictionary.Add(key, "myvalue");
  }
  catch(ArgumentException) 
  {
    // do something about extra key
  }
}
scottm
@divo, it depends on how many keys you are adding and if you plan on having duplicates. The ContainsKey() method iterates every item in the dictionary each time, so as the dictionary grows, the "simple if" can be a big draw. If you don't plan on having duplicates, you aren't using exceptions for control flow because you aren't expecting an exception to be thrown. It's can be cheaper than the if in that context.
scottm
I am expecting dupes and I need to filter them out - catching the exceptions, even if it's wrong to use it for flow control, sounds better than calling ContainsKey every single time (most of the items won't be dupes), but will it boost performance?
JohnIdol
You should follow JaredPar's advice on that question. If this portion of code is already noticeably affecting performance, you may try the change and see if you gain anything.
scottm
that was always the plan :)
JohnIdol
+1  A: 

The obvious mistake (if we discuss performance) I can see is the double work done when calling ContainsKey and then adding the key-value pair. When the pair is added using Add method, the key is again internally checked for presense. The whole if block can be safely replaced by this:

... myDictionary[currentKey] = ""; ...

If the key already exists there, the value will be just replaces and no exception will get thrown. Moreover, if the value is not used at all I would personally use null values to fill it. Can see no reason for using any string constant there.

Alexander
A: 

The possible performance degrade mentioned by scottm is not for doing simple lookups. It is for calculating the intersection between 2 sets. HashSet does have slightly faster lookups than Dictionary. The performance difference really is going to be very small, though, as everyone says -- the lookup takes most of the time & creating the KeyValuePair takes very little.

For 2.0, you could make the "Value" object one of these:

public struct Empty {}

It may do slightly better than the "".

Or you could try making a reference to System.Core.dll in your 2.0 project, so you can use the HashSet.

Also, make sure that GetHashCode and Equals are as efficient as possible for MyType. I've been bitten by using a dictionary on something with a really slow GetHashCode (I believe we tried to use a delegate as a key or something like that.)