views:

254

answers:

4

Consider an array versus a hashtable where the keys are just the integral indexes of a list.

Their average-case insertion, lookup, and removal big-O bounds are all O(1) constant time. I understand that you may get some low-level wins in cache locality with an array, and there is a marginal (mostly-constant) overhead to the hashtable operations, but hashtables get you sparseness for free, which in some applications is a big win.

What other significant (or small) contrasts am I missing?

Context: I get into a discussion about this sometimes when interviewing programming candidates. Usually the context is "how would you implement the Javascript array type inside of the JS VM?" For densely-packed data, I side with a native array, but I want to have better reasoning than the intuition that "it seems less like overkill".

+1  A: 

Because a list is typically ordered whereas a hashtable isn't. In the context where you add elements into a list and expect the ordering to remain consistent, a hashtable provides no guarantees as to the order you get back out while an array retains the ordering.

tvanfosson
Thanks tvanfosson. I updated the question-- I realize I wasn't clear initially: I assume the keys in the table are just the integer indexes of the "list", meaning that the "order is preserved" if you access the objects by integer "index".
quixoto
That's not true of any efficient hash table I've ever heard of.
bmargulies
@bmargulies: Of course you're correct for an abstract enumeration through the hash buckets. But if all my keys represent indexes into a list, I can write a `for` loop over my known "key" range and get out the right values from the indexes I put them in at.
quixoto
@Ben - Maybe I'm missing something but wouldn't you have to have a bucket per integer to make this happen? If you have more than one integer hash into the same bucket, you can't preserve order. If you do have one integer per bucket, how is that different than a sparsely populated array?
tvanfosson
@tvanfosson: Let's say number of buckets and hash algorithm is implementation-specific. But assuming you have some `put(key, obj)` operation and a `get(key)` operation, you can iterate over your indexes using the get operation with increasing indexes for the key. (It could be that the answer is just "all the magic is in this implementation-specific part".)
quixoto
A: 

Because hash functions aren't free. Linear factors are important. Worst case times are important. Count the instructions.

For the particular case you cite, which is the underlying implementation of Javascript, there may be so much other overhead to wipe these issues out. Still, if someone tries to do something mathematical that really hits an array hard with simple numeric keys, an array has got to be better.

bmargulies
+3  A: 

An array is jsut a special case of a hash table where the hash function is very simple

f(x) := x;

and the modulo used is the same as the data word size (and thus the array size).

If you do not allow non unique values you do not need the "next" pointers and voila, we have an array.

Because of the absence of a complex hash function and modulo calculation, this is very fast, but only applicable when the array can be kept small (very large arrays with lot's of empty places waste memory resources and might trigger nasty things like swapping/trashing to disk).

Ritsaert Hornstra
This is the correct answer.
slebetman
Interesting way of looking at it. Thanks. OK, so basically, there are no further obvious distinctions beyond what I perceive-- really, a hashtable in this scenario is an array with somewhat more overhead and free sparseness, useful in a "does it fit your data/scenario better" kind of way.
quixoto
+1  A: 

When you look at it from the angle of someone who wants to implement the pseudo array behaviour of Javascript, you are right that a hashtable is the better way to do this, esp. since Javascript arrays have no fixed length and need to be able to accomodate entries at any index. Arrays in Javascript just look like arrays but behave more like hashtables.

But in a language that is a bit closer to the machine the performance and space benefits of using a real array for data that can be stored efficiently in an array are quite noteworthy, especially since the benefits of using hashtables for this are pretty limited to sparse arrays which is not what you would or should use an array for. This is actually better done with hashtables that have integer keys.

Insertion, lookup, and removal is also O(1) for arrays in all cases but has a much lower constant O than hashtables (it's not just because of cache locality). And arrays need less space per entry. If you would like to remove and insert entries in a way that the following entries change their index accordingly, it would be O(n) with n beeing the number of entries that need to be moved, but it would also be O(n) for hashtables to do this and again with a much higher constant overhead. This is the kind of operation for which you better use a linked list. Also growing an array is less costly than growing a hashtable that might need to rehash all of it's entries.

All of the different collection types have their specific advantages and disadvantages. That's why there are so many of them in use.

x4u