ansaurus

Question

Mapping from String to integer - performance of various approaches

Answer 1

+4 A:

You're right: a List would be O(n), a HashMap would be O(1), so a HashMap would be faster for n large enough so that the time to calculate the hash didn't swamp the List linear search.

I don't know the threshold size; that's a matter for experimentation or better analytics than I can muster right now.

duffymo 2010-10-21 09:51:23

HashMaps are not O(1), as you may have multiple values having the same hash value.

Thorbjørn Ravn Andersen 2010-10-21 10:56:26

@Thorbjørn Ravn Andersen: yes, but if you choose the hash function carefully, you can keep it down to a minimum.

Chii 2010-10-21 11:06:31

@Thorbjørn Ravn Andersen - they are typically described as `O(1)` on average, but the worst case is `O(N)`. Unless you've got a bad hash function, the probability of the worst case becomes vanishingly small as `N` increases.

Stephen C 2010-10-21 11:13:47

@Stephen The O-function does not describe the average, but the worst case.

Thorbjørn Ravn Andersen 2010-10-21 11:18:09

@Chii, And doing so requires knowing the dataset up front, yes?

Thorbjørn Ravn Andersen 2010-10-21 11:18:47

@Thorbjørn Ravn Andersen - not true. It describes what it describes. For example: http://en.wikipedia.org/wiki/Quicksort#Average_complexity

Stephen C 2010-10-21 11:23:21

@Stephen, please link to the correct information. http://en.wikipedia.org/wiki/Big_O_notation. " big O notation (also known as Big Oh notation, Landau notation, Bachmann–Landau notation, and asymptotic notation) describes the limiting behavior of a function when the argument tends towards a particular value or infinity". Anything depending on the value of n (which the hash bucket size does) is more than O(1)

Thorbjørn Ravn Andersen 2010-10-21 12:10:42

@Thorbjørn Ravn Andersen hashmap has internal enhanced hash function that prevents collision and falling in to same bucket for different keys.

Suresh S 2010-10-21 12:22:15

@Thorbjørn Ravn Andersen - java.util.HashMap resizes the map using a load factor. This reduces the *average* number of collisions per bucket to a value that is *independent of the number of map entries `N`*. Hence the `O(1)` time per lookup using `HashMap`.

Stephen C 2010-10-21 15:22:09

Answer 2

+4 A:

Your question is totally correct on all points:

HashMaps are better (they use a hash)
Benchmarking Java code is hard

But at the end of the day, you're just going to have to benchmark your particular application. I don't see why HashMaps would be slower for small cases but the benchmarking will give you the answer if it is or not.

One more option, a TreeMap is another map data structure which uses a tree as opposed to a hash to access the entries. If you are doing benchmarking, you might as well benchmark that as well.

Regarding benchmarking, one of the main problems is the garbage collector. However if you do a test which doesn't allocate any objects, that shouldn't be a problem. Fill up your map/list, then just write a loop to get N random elements, and then time it, that should be reasonably reproducable and therefore informative.

Adrian Smith 2010-10-21 09:51:55

Answer 3

+2 A:

aioobe 2010-10-21 09:53:49

Your trie has coalesced some of the letters into a single node (a trie with one letter per node would have an `O(length-of-string)` lookup time) - that's going to be a nightmare to write (each insert would have to coalesce / expand nodes as it looked for the right place)!

Nicholas White 2010-10-21 12:24:59

Hum, no, perhaps the picture looks like it, but the text in the node merely describes the string from the root to that node.

aioobe 2010-10-21 12:26:58

Answer 4

+1 A:

From what I can remember, the list method will be O(n),but would be quick to add items, as no computation occurs. You could get this lower O(log n) if you implemented a b-search or other searching algorithms. The hash is O(1), but its slower to insert, since the hash needs to be computed every time you add an element.

I know in .net, theres a special collection called a HybridDictionary, that does exactly this. Uses a list to a point, then a hash. I think the crossover is around 10, so this may be a good line in the sand.

I would say you're correct in your above statement, though I'm not 100% sure if a list would be faster for small sets, and where the crossover point is.

jasper 2010-10-21 09:57:49

Answer 5

+2 A:

Unfortunately, you are going to have to benchmark this yourself, because the relative performance will depend critically on the actual String values, and also on the relative probability that you will test a string that is not in your mapping. And of course, it depends on how String.equals() and String.hashCode() are implemented, as well as the details of the HashMap and List classes used.

In the case of a HashMap, a lookup will typically involve calculating the hash of the key String, and then comparing the key String with one or more entry key Strings. The hashcode calculation looks at all characters of the String, and is therefore dependent on the key String. The equals operations typically will typically examine all of the characters when equals returns true and considerably less when it returns false. The actual number of times that equals is called for a given key string depends on how the hashed key strings are distributed. Normally, you'd expect an average of 1 or 2 calls to equal for a "hit" and maybe up to 3 for a "miss".

In the case of a List, a lookup will call equals for an average of half the entry key Strings in the case of a "hit" and all of them in the case of a "miss". If you know the relative distribution of the keys that you are looking up, you can improve the performance in the "hit" case by ordering the list. But the "miss" case cannot be optimized.

In addition to the trie alternative suggested by @aioobe, you could also implement a specialized String to integer hashmap using a so-called perfect hash function. This maps each of the actual key strings to a unique hash within a small range. The hash can then be used to index an array of key/value pairs. This reduces a lookup to exactly one call to hash function and one call to String.equals. (And if you can assume that supplied key will always be one of the mapped strings, you can dispense with the call to equals.)

The difficulty of the perfect hash approach is in finding a function that works for the set of keys in the mapping and is not too expensive to compute. AFAIK, this has to be done by trial and error.

But the reality is that simply using a HashMap is a safe option, because it gives O(1) performance with a relatively small constant of proportionality (unless the entry keys are pathological).

(FWIW, my guess is that the break-even point where HashMap.get() becomes better than List.contains() is less than 10 entries, assuming that the strings have an average length of 5 to 10.)

Stephen C 2010-10-21 10:00:59

You could add that the number of "hits" and "misses" can be tweaked by setting the load-factor.

aioobe 2010-10-21 10:25:22

@aioobe - that tweaks the number of hashtable collisions. My "hits" and "misses" are about whether the input string can be mapped (a hit) or not (a miss). The hit/miss ratio is actually rather significant if a `List` is used.

Stephen C 2010-10-21 10:48:05

Answer 6

+1 A:

I think a HashMap will always be better. If you have n strings each of length at most l, then String#hashCode and String#equals are both O(l) (in Java's default implementation, anyway).

When you do List#indexOf it iterates through the list (O(n)) and performs a comparison on each element (O(l)), to give O(nl) performance.

Java's HashMap has (let's say) r buckets, and each bucket contains a linked list. Each of these lists is of length O(n/r) (assuming the String's hashCode method distributes the Strings uniformly between the buckets). To look up a String, you need to calculate the hashCode (O(l)), look up the bucket (O(1) - one, not l), and iterate through that bucket's linked list (O(n/r) elements) doing an O(l) comparison on each one. This gives a total lookup time of O(l + (nl)/r).

As the List implementation is O(nl) and the HashMap implementation is O(nl/r) (I'm dropping the first l as it's relatively insignificant), lookup performance should be equivalent when r=1 and the HashMap will be faster for all greater values of r.

Note that you can set r when you construct the HashMap using this constructor (set the initialCapacity to r and the loadFactor argument to n/r for your given n and chosen r).

Nicholas White 2010-10-21 11:49:56

ansaurus

tags:

views:

answers:

Mapping from String to integer - performance of various approaches

related questions