views:

277

answers:

4

I've doing some performance analysis on the software I develop, and I've found that lookups on a global dictionary of URL's takes about 10% of the application's "load" phase time. The dictionary is implemented as a C++ STL std::map, which has O(lg n) lookups. I'm going to move it to a hash_map, which has roughly fixed time lookups. The stl string class doesn't have a hash code property, and it certainly doesn't cache a hash code. That means that each lookup requires re-generating the hash code.

I'm skeptical that caching the hash code is worth the effort. It would mean changing many lines of code to use a new string class with a cached hash code property. Given that the current implementation does log(n) full string comparisons on every lookup, I think reducing it to basically one string traversal (by the hash function) per lookup is a big win.

Does anyone have experience with caching string hash codes? Has it ever proven worth the effort?

+2  A: 

One word of warning.

While a hash map can have fixed time lookups, it also can end up having O(N) lookups. While it's not a common case, it does happen.

So while you always have to pay for the O(log N) time in a map, you are also guaranteed that it will not be worse.

R Samuel Klatchko
On implementations written by monkeys smashing a keyboard.
GMan
It depends on the hashing function as well? A good hash function should leave few collisions. If I recall correctly the load factor could play a part as well, depending on the kind of hashmap.
Skurmedel
@GMan: I guess a lot of monkeys have found gainful employment as programmers and Shakespearan ghostwriters, then, because I've seen some pretty bad hashes.
Steven Sudit
@Skurmedel: Yes, a hash that does `return 0;` or some other crappy thing, or a hash map with a size of 1. Neither of those really exist in real code. Hash maps are almost always ~O(1). @Steven: Yuck; why are people making their own hashes? That's as foolish as thinking it's easy to make a good random umber generator. :)
GMan
@GMan: malicious input designed to collide for your hash? ;-)
Steve Jessop
@Steve: Clearly the work of monkeys. >:(
GMan
Unless you have a perfect hash function designed for your fix set of inputs, it can happen. A got bit by this once about 10 years ago and it's made me cautious about hash tables ever since.
R Samuel Klatchko
@GMan: Not so much making up their own as randomly picking a known one that happens to be terrible. There are plenty of hashes out there in real code that have generally poor distribution, or at least huge Achiles' heels when subjected to a certain range of inputs. So, yes, I generally agree that we should trust the hash algorithms in reliable libraries, since there are also plenty of algorithms that yield good results for almost any input.
Steven Sudit
@R Samuel Klatchko:if you're really concerned with the worst case performance of a hash table, you can limit it to O(log N) instead of O(N) (use trees instead of lists for hash collisions).
Jerry Coffin
@Jerry: Really cool idea!
David Gladfelter
@Jerry: Yes, interesting, particularly if the tree structure helps avoid a lot of copying when it's time to expand the table. Although I'm not sure whether it can.
Steven Sudit
@Jerry - interesting idea. Do you know of any implementations of unordered_map that use that?
R Samuel Klatchko
@Steven Sudit:For the most part, when you use it you just don't expand the table -- degradation only starts to become noticeable if you overfill the table by a *huge* margin (>100000:1). @R samuel:I've done an implementation (basically just an std::vector<std::map>), but I don't know of anybody else having done it.
Jerry Coffin
@R Samuel: you can't implement `unordered_map` itself that way, because the key type of an `unordered_map` isn't necessarily orderable (no `operator<` or `std::less` specialization), so you can't necessarily have a map of them. You can do what Jerry did, which is define your own associative map which requires that the keys are hashable *and* ordered (and which takes the corresponding optional extra functor parameters).
Steve Jessop
+1  A: 

I don't have experience with caching hash codes, but I've done some work recently converting std::map to std::tr1::unordered_map. Two thoughts come to mind. First, try profiling that relatively simple change first, because it sometimes makes things worse, depending on what your code is doing. It might give you enough speedup on its own before you try optimizing further. Secondly, what does your profiler say about the other 90% of your initialization time? Even if you optimized the global dictionary stuff down to 0 time, you will at most improve performance by 10%.

Kristo
Hi Kristo, Thanks for the tips. To answer your question, I've already killed about 40% of the previous load time and I'm running out of low-hanging fruit. That 10% is relatively juicy and in reach.
David Gladfelter
+2  A: 

You'll of course need to profile to check your results. Change to a hash map, and then see where most of your time is spent. Unless you're hashing keys left and right, I doubt most of your time will be spent there. Hashing is intended to be a fast operation, otherwise a hash map would have no advantages over an ordered container.

The compiler itself will know if a string hasn't been changed, and can probably cache the result for you (within the same scope). That said, you don't want to inherit from std::string; STL classes weren't made for that.

Rather, make a std::pair and pass that around:

std::pair<const std::string, const size_t> string_hash_pair;

You'd then need to overload the (going by Boost here, not TR1; I don't know how similar they are) hash_value function for your type, in the same namespace as the pair is defined:

size_t hash_value(const string_hash_pair& pPair)
{
    return pPair.second; // don't actually hash
}

And that's it. Note that in the pair, both string and size_t are immutable. This is because if the string changes, your hash is wrong. So we make it const, and we may as well make the hash const too.

You'll want a helper function:

string_hash_pair make_string_hash(const std::string& pStr)
{
    return std::make_pair(pStr, boost::hash_value(pStr));
}

Now if you're going to be using a string for look-ups, just make a pair out of it and you get constant-time hashing.

That said, I really doubt this much work is necessary. Hashing functions really are trivial, usually. Also, don't make your own. Use a pre-existing tried-and-tested hash; it's quite easy to make a crappy hash.

GMan
If I understand correctly, the hash of the key is calculated exactly once during lookup, no matter how big the table is, so caching it shouldn't help much. The hash is calculated during insertion, but each item is only inserted once.
Steven Sudit
@Steven: It's calculated once per look-up. But if you look up with the same key multiple times, you may calculate the hash multiple times. I think he wants to avoid that.
GMan
@GMan: That's true. I can't tell from the OP whether the same key will be looked up more than once, but if so then that would justify it more. Of course, there might also be ways to reorganize the code so that values are not looked up any more than absolutely necessary.
Steven Sudit
+3  A: 

When you compare the hash map to the map, also try a Trie, or related data structure (whatever you can get off the shelf):

http://stackoverflow.com/questions/1036504/trie-implementation

Unfortunately you may then spend a lot of time worrying about cache-friendliness. In that respect a Trie is similar to the tree you already have, and a hash map will probably be better-behaved than a naively-allocated tree.

Also, I'm a little confused by the question. If you're looking up the same string object multiple times, such that caching its hash value is worthwhile, shouldn't you just be caching the result of the lookup? The whole point of a hash table is that different objects which are value-equal hash to the same value. If you aren't computing the same hash several times from distinct strings containing the same characters, then your hash table probably isn't doing its job.

If you mean caching the values of the keys already in the hash table, that's up to the hash table.

Steve Jessop