tags:

views:

911

answers:

8

Can someone please shed some light on how popular languages like Python, Ruby implements hash tables internally for symbol lookup? Do they use the classic "array with linked-list" method, or use a balanced tree?

I need a simple (fewer LOC) and fast method for indexing the symbols in a DSL written in C. Was wondering what others have found most efficient and practical.

+3  A: 

Balanced trees sort of defeat the purpose of hash tables since a hash table can provide lookup in (amortized) constant time, whereas the average lookup on a balanced tree is O(log(n)).

Separate chaining (array with linked list) really works quite well if you have enough buckets, and your linked list implementation uses a pooling allocator rather than malloc()ing each node from the heap individually. I've found it to be just about as performant as any other technique when properly tuned, and it is very easy and quick to write. Try starting with 1/8 as many buckets as source data.

You can also use open addressing with quadratic or polynomial probing, as Python does.

Crashworks
logarithmic defeat constant time?
Nick D
@tydok - to "defeat the purpose" means to fail to meet whatever objective the other solution meets, so it means "worse than", not "better than".
Daniel Earwicker
gaffe :) -
Nick D
+3  A: 

Attractive Chaos have a comparison of Hash Table Libraries and a update. The source code is available and it is in C and C++

Lear
+1  A: 

What Crashworks mean to say was....

The purpose of Hash tables are constant time lookup, addition and deletion. In terms of Algorithm, the operation for all operation is O(1) amortized. Whereas in case you use tree ...the worst case operation time will be O(log n) for a balanced tree. N is the number of nodes. but, do we really have hash implemented as Tree?

Thanks for pointing out my inclarity -- I've fixed my answer.
Crashworks
A hash implemented as a tree is a tree with a hash-like API on the front.
Blank Xavier
+7  A: 

The classic "array of hash buckets" you mention is used in every implementation I've seen.

One of the most educative versions is the hash implementation in the Tcl language, in file tcl/generic/tclHash.c. More than half of the lines in the file are comments explaining everything in detail: allocation, search, different hash table types, strategies, etc. Sidenote: the code implementating the Tcl language is really readable.

zvr
Earlier versions of the code are even more readable due to the reduced amounts if ifdeffery, though later versions are more useful in critical ways (supporting key customization and other things like that).
Donal Fellows
A: 

IMO, hashes are a Bad Thing.

Programmers are both unable (real data not being equal to test data) and unwilling (they don't check at all, and don't check in the field) to verify the fitness of their hashing algorithm.

Indeed, most programmers I have met have no idea there could BE a problem here.

As such, the practical upshot of using a hash is that a significant potential failure mode is introduced into the code, which means, of course, that for a certain proportion of programmes, they are then failing once in the field, which requires fixing a post-deployment installation.

Unless you are using large data sets and so must have O(1), use a balanced tree. It is better simply by being reliable.

Blank Xavier
It's pretty easy to botch an R-B tree as well.
Crashworks
Wow, who knew Python, Ruby, Perl, PHP and Javascript were all so wrong?
Schwern
This answer could use some clarification - by "significant potential failure mode", do you mean an actual bug (what?), or just worse-than-O(1) performance? Because using a balanced tree istead will simply guarantee worse-than-O(1) performance! Hardly an improvement. Even without careful tuning, a hash will outperform a balanced tree in the majority of cases, so it makes more sense to use a hash, then decide if performance is acceptable.
Daniel Earwicker
I mean pathological behaviour, where the input data becomes significantly or grossly mismatched to the behaviour of the hashing algorithm.
Blank Xavier
The assertation that "Even without careful tuning, a hash will outperform a balanced tree in the majority of cases" is EXACTLY what I mean by saying that programmers are unaware of the issues involved in hashing algorithms. You are asserting *that you got your algorithm right and its as right for real data as it is for your test data*. You cannot make such a statement in the general case and the fact that you *do* means you don't realise what you're saying.
Blank Xavier
@Cashworks: you misunderstand. An RB tree with a bug is a broken RB tree. It will not work. A hash with a bad algorthim is *not bugged*. It will not crash nor will it produce an invalid hash. It will however produce a pathological hashed hash. These are two completely different failures modes. Conflating them is a fundamental error.
Blank Xavier
It's true that really understanding hash functions and producing good ones requires some specialist knowledge. But you CAN generally give good enough guidelines for a reasonable programmer to produce ACCEPTABLE hash functions in most cases (and IDEs can auto-generate them). Hashing unfortunately is a really useful technique, so even though some find it difficult to grasp, in this case, their usefulness makes it worth persevering with them (or hiring better programmers).
Neil Coffey
@Neil Coffey: "If you can afford the time always use a cryptographic hash." ...not that it address the problem of understanding what makes a good hash function, or anything. And it means that for small sets the balanced tree may well be faster.
dmckee
Instead of making this blanket statement that hashes are bad, why not illuminate the issues. Since the answerer is opaque, I assume they mean:If the hashing function has grossly non-uniform outputs for input the data, it might produce a really unbalanced hash table. Determining uniformity of outputs is hard, and when people use bad hashing algorithms, they might get surprisingly bad lookup performance. So be careful doing this at home! At least with trees you can guarantee the performance, if you get the algorithm right :)
Gregg Lind
I have to agree that hash tables are bad. I almost always use red-black trees or AVL trees instead. But this question is not about that. He asked *how* hash tables are implemented, not whether he should use them.
Zifre
@Gregg: "Programmers are both unable and unwilling to verify the fitness of their hashing algorithm." What is opaque about that?
Blank Xavier
@Zifre: yes, and I made a related comment with an important point on the same subject. Do you expect all comments to purely answer and only answer the question? such a policy would be to the detriment of all.
Blank Xavier
This reminds me of the programmer that was afraid of using git because it uses SHA1 digests to identify commits. "They MIGHT collide! You CAN'T PROVE they won't collide! It's NOT SAFE!" This is all technically correct (the best kind of correct), but practically irrelevant. Throwing out hashes because you can't guarantee performance is just as silly. Finally, and most importantly, what in the screaming hell are you doing writing your own hashes and trees for production code in 2009?
Schwern
> This is all technically correct (the best kind of correct), but > practically irrelevant.The behaviour of SHA1 is well understood. It is a single algorithm for a single specific purpose and for that purpose it works well.This is a profoundly different state of affairs to the use a general purpose hash, possible with a novel hashing algorithm dreamed up during development, which seems to work okay with the test data.
Blank Xavier
> Throwing out hashes because you can't guarantee performance is<BR>> just as silly.<BR><BR>This mis-emphasises the problem. The issue is that you run the risk of pathological performance, e.g. sufficiently poor to render your application unfit for purpose.
Blank Xavier
sigh - no way to quote in comments, FFS...
Blank Xavier
"Finally, and most importantly, what in the screaming hell are you doing writing your own hashes and trees for production code in 2009?"IME, most places do C code without using libraries of pre-written data structures. Also, sometimes you have to write a specialised variant of a data structure which simply isn't available.
Blank Xavier
@Schwern: +1 for "what in the screaming hell" :-) OK, I have written my own in the past *where I had a use-case that justified it* but I was very careful with implementation choices. In particular, hash function choice is remarkably hard and the literature on the topic especially unrevealing (as long as you're hanging chains off the hash buckets, it's not critical to get it right).
Donal Fellows
+8  A: 

Perl uses an array with linked lists to hold collisions. It has a simple heuristic to automatically double the size of the array as necessary. There's also code to share keys between hashes to save a little memory. You can read about it in the dated but still relevant Perl Illustrated Guts under "HV". If you're truly adventurous you can dig into hv.c.

The hashing algorithm used to be pretty simple but its probably a lot more complicated now with Unicode. Because the algorithm was predictable there was a DoS attack whereby the attacker generated data which would cause hash collisions. For example, a huge list of keys sent to a web site as POST data. The Perl program would likely split it and dump it into a hash which then shoved it all into one bucket. The resulting hash was O(n) rather than O(1). Throw a whole lot of POST requests at a server and you might clog the CPU. As a result Perl now perturbs the hash function with a bit of random data.

You also might want to look at how Parrot implements basic hashes which is significantly less terrifying than the Perl 5 implementation.

As for "most efficient and practical", use someone else's hash library. For god's sake don't write one yourself for production use. There's a hojillion robust and efficient ones out there already.

Schwern
+1  A: 

If you can read Java, you might want to check out the source code for its various map implementations, in particular HashMap, TreeMap and ConcurrentSkipListMap. The latter two keep the keys ordered.

Java's HashMap uses the standard technique you mention of chaining at each bucket position. It uses fairly weak 32-bit hash codes and stores the keys in the table. The Numerical Recipes authors also give an example (in C) of a hash table essentially structured like Java's but in which (a) you allocate the nodes of the bucket lists from an array, and (b) you use a stronger 64-bit hash code and dispense with storing keys in the table.

Neil Coffey
A: 

Lua tables use an utterly ingenious implemenation which for arbitrary keys behaves like 'array of buckets', but if you use consecutive integers as keys, it has the same representation and space overhead as an array. In the implementation each table has a hash part and an array part.

I think this is way cool :-)

Norman Ramsey