tags:

views:

598

answers:

5

Ok, here's my situation:

I have an Array of States, which may contain duplicates. To get rid of the duplicates, I can add them all to a Set.

However when I create the Set, it wants the initial capacity and load factor to be defined, but what should they be set to?

From googling, I have come up with:

String[] allStates = getAllStates();
Set<String> uniqueStates = new HashSet<String>(allStates.length, 0.75);

The problem with this, is that allStates can contain anwhere between 1 and 5000 states. So the Set will have capacity of over 5000, but only containing at most 50.

So alternatively set the max size of the Set could be set to be the max number of states, and the load factor to be 1.

I guess my questions really are:

  • What should you set the initial capacity to be when you don't know how many items are to be in the Set?
  • Does it really matter what it gets set to when the most it could contain is 50?
  • Should I even be worrying about it?

Thanks for any advice, Clarkey.

+6  A: 

Use the constructor where you don't need to specify these values, then reasonable defaults are chosen.

starblue
Spot on. Worry about performance only if becomes an issue
basszero
Well, sensibly avoid performance issues but don't microoptimise.
Tom Hawtin - tackline
A: 

Make a good guess. There is no hard rule. If you know there's likely to be say 10-20 states, i'd start off with that number (20).

tehvan
+1  A: 

The safe bet is go for a size that is too small.

Because resizing is ameliorated by an exponential growth algorithm (see the stackoverflow podcast from a few weeks back), going small will never cost you that much. If you have lots of sets (lucky you), then it will matter to performance if they are oversize.

Load factor is a tricky one. I suggest leaving it at the default. I understand: Below about 0.70f you are making the array too large and therefore slower. Above 0.80f and you'll start getting to many key clashes. Presumably probing algorithms will require lower load factors than bucket algorithms.

Also note that the "initial capacity" means something slightly different than it appears most people think. It refers to the number of entries in the array. To get the exact capacity for a number of elements, divide by the desired load factor (and round appropriately).

Tom Hawtin - tackline
+1  A: 

Assuming that you know there won't be more than 50 states (do you mean US States?), the

Set<String> uniqueStates = new HashSet<String>(allStates.length, 0.75);

quoted is definitely wrong. I'd suggest you go for an initial capacity of 50 / 0.75 = 67, or perhaps 68 to be on the safe side.

I also feel the need to point out that you're probably overthinking this intensely. Resizing the arraylist twice from 16 up to 64 isn't going to give you a noticeable performance hit unless this is right in the most performance-critical part of the program.

So the best answer is probably to use:

new HashSet<String>();

That way, you won't come back a year later and puzzle over why you chose such strange constructor arguments.

Zarkonnen
A: 

I second Zarkonnen. Your last question is the most important one. If this happens to occur in a hotspot of your application it might be worth the effort to look at it and try to optimise, otherwise CPU cycles are cheaper than burning up your own neurons.

Jeroen van Bergen