views:

9610

answers:

10

I'm building a symbol table for a project I'm working on. I was wondering what peoples opinions are on the advantages and disadvantages of the various methods available for storing + creating a symbol table.

I've done a fair bit of searching and the most commonly recommended are binary trees or linked lists or hash tables. I was wondering what are the advantages and or disadvantages of all of the above (I can't find anything on this).

Thanks, Ben

Update: am working in c++

A: 

This depends on several things, of course. I'd say that a linked list is right out, since it has few suitable properties to work as a symbol table. A binary tree might work, if you already have one and don't have to spend time writing and debugging it. My choice would be a hash table, I think that is more or less the default for this purpose.

unwind
A: 

This question goes through the different containers in C#, but they are similar in any language you use.

Mats Fredriksson
+5  A: 

The standard trade offs between these data structures apply.

  • Binary Trees
    • medium complexity to implement (assuming you can't get them from a library)
    • inserts are O(logN)
    • lookups are O(logN)
  • Linked lists (unsorted)
    • low complexity to implement
    • inserts are O(1)
    • lookups are O(N)
  • Hash tables
    • high complexity to implement
    • inserts are O(1) on average
    • lookups are O(1) on average

Edited: fixed expense of linked list insert.

Darron
For an unsorted linked list, inserts are O(1), not O(N), which, along with O(1) removal when doubly-linked, is usually the motivation to use them, not their implementation complexity. Another motivation is that they can grow unbounded, with no copying. Not that I'd suggest one in this case.
P Daddy
Also I would argue that a hash-table is about as easy to implement as a correctly balanced binary tree. But this is highly subjective.
JBF
Yes, implementation complexity is subjective. But I think that a minimal linked list is simpler than a minimal hash table. And then adding auto-balancing vs. collisions and resizing when full doesn't swap the order.
Darron
+8  A: 

Your use case is presumably going to be "insert the data once (e.g., application startup) and then perform lots of reads but few if any extra insertions".

Therefore you need to use an algorithm that is fast for looking up the information that you need.

I'd therefore think the HashTable was the most suitable algorithm to use, as it is simply generating a hash of your key object and using that to access the target data - it is O(1). The others are O(N) (Linked Lists of size N - you have to iterate through the list one at a time, an average of N/2 times) and O(log N) (Binary Tree - you halve the search space with each iteration - only if the tree is balanced, so this depends on your implementation, an unbalanced tree can have significantly worse performance).

Just make sure that there are enough spaces (buckets) in the HashTable for your data (R.e., Soraz's comment on this post). Most framework implementations (Java, .NET, etc) will be of a quality that you won't need to worry about the implementations.

Did you do a course on data structures and algorithms at university?

JeeBee
haven't left high school... so no. all self taught :)
benofsky
Ah, good stuff :)
JeeBee
O(1) for hashtable lookups only apply if the number of buckets are a good fraction of the total set. I.e. if you are storing 1 million entries in 512 buckets, then you will still be doing 2048 straight compares pr lookup, which is more than log(n) of 1 million ( or 13 straight compares pr lookup)
Soraz
Good point Soraz.
JeeBee
A quality implementation of a hash table, with a quality hashing algorithm will give O(1). A bad implementation of binary tree could also be worse than O(log N). So, for the level of question asked, saying a Hash Table is O(1) is probably more than good enough.
Darron
Symbol tables have other properties, that make hash tables often not the most suitable. -1
Stephan Eggermont
@Stephan: do elaborate. I claim that hash tables are by far the most common data structure used for symbol tables.
Konrad Rudolph
A: 

Unless you expect your symbol table to be small, I should steer clear of linked lists. A list of 1000 items will on average take 500 iterations to find any item within it.

A binary tree can be much faster, so long as it's balanced. If you're persisting the contents, the serialised form will likely be sorted, and when it's re-loaded, the resulting tree will be wholly un-balanced as a consequence, and it'll behave the same as the linked list - because that's basically what it has become. Balanced tree algorithms solve this matter, but make the whole shebang more complex.

A hashmap (so long as you pick a suitable hashing algorithm) looks like the best solution. You've not mentioned your environment, but just about all modern languages have a Hashmap built in.

Martin Cowie
+3  A: 

A couple of things to watch out for.

  • Binary trees only have O(log n) lookup and insert complexity if the tree is balanced. If your symbols are inserted in a pretty random fashion, this shouldn't be a problem. If they're inserted in order, you'll be building a linked list. (For your specific application they shouldn't be in any kind of order, so you should be okay.) If there's a chance that the symbols will be too orderly, a Red-Black Tree is a better option.

  • Hash tables give O(1) average insert and lookup complexity, but there's a caveat here, too. If your hash function is bad (and I mean really bad) you could end up building a linked list here as well. Any reasonable string hash function should do, though, so this warning is really only to make sure you're aware that it could happen. You should be able to just test that your hash function doesn't have many collisions over your expected range of inputs, and you'll be fine. One other minor drawback is if you're using a fixed-size hash table. Most hash table implementations grow when they reach a certain size (load factor to be more precise, see here for details). This is to avoid the problem you get when you're inserting a million symbols into ten buckets. That just leads to ten linked lists with an average size of 100,000.

  • I would only use a linked list if I had a really short symbol table. It's easiest to implement, but the best case performance for a linked list is the worst case performance for your other two options.

Bill the Lizard
As to 1: That is a good point. When I've implemented symbol tables in the past, I've generally found that my entries are encountered in pretty much random (alphabetical) order. Because of that, there really wasn't enough payoff to make it worth balancing the tree.
T.E.D.
+8  A: 

What everybody seems to forget is that for small Ns, IE few symbols in your table, the linked list can be much faster than the hash-table, although in theory its asymptotic complexity is indeed higher.

There is a famous qoute from Pike's Notes on Programming in C: "Rule 3. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don't get fancy." http://www.lysator.liu.se/c/pikestyle.html

I can't tell from your post if you will be dealing with a small N or not, but always remember that the best algorithm for large N's are not necessarily good for small Ns.

JBF
what would define N as big?
benofsky
That's implementation-dependent. If you happen to know the algorithm for calculating the hash values, you can ballpark how expensive it would be versus n/2 identity comparisons (the average for a linked list) or log(n) identity comparisons (the average for a binary tree).
Hank Gay
You don't mention which language you're working in, but if it has good built-in support for dictionaries/hashtables/whatever-that-lang-called-it, e.g., Python, it's probably easiest to just learn to stop worrying and love the built-in.
Hank Gay
As Hank wrote what the limit for big is impossible to guess without knowing: your input data set, your hash algorithm, your programming language (whether strings are interned or not) etc.Often you can get it wrong knowing all of the above. Go with what's easiest to code, fix later if it's to slow.
JBF
Also, the avg. for a binary tree should have been (log n) / 2
Hank Gay
Also "time to debug weird errors" is much higher with fancy algorithms. Keep it simple, until simple proves to be untenable.
Jeff Allen
I know this is well after the fact, but I just have to comment on this. If you are writing a program for large N's, then use the complex and fast algorithm. For that program, if you get some cases where N is small, who cares if it is "relatively" slow. If it is fast enough for large N, it will surely be fast enough for small N as well.
lkessler
+2  A: 

I like Bill's answer, but it doesn't really synthesize things.

From the three choices:

Linked lists are relatively slow to lookup items from (O(n)). So if you have a lot of items in your table, or you are going to be doing a lot of lookups, then they are not the best choice. However, they are easy to build, and easy to write too. If the table is small, and/or you only ever do one small scan through it after it is built, then this might be the choice for you.

Hash tables can be blazingly fast. However, for it to work you have to pick a good hash for your input, and you have to pick a table big enough to hold everything without a lot of hash collisions. What that means is you have to know something about the size and quantity of your input. If you mess this up, you end up with a really expensive and complex set of linked lists. I'd say that unless you know ahead of time roughly how large the table is going to be, don't use a hash table. This disagrees with your "accepted" answer. Sorry.

That leaves trees. You have an option here though: To balance or not to balance. What I've found by studying this problem on C and Fortran code we have here is that the symbol table input tends to be sufficiently random that you only loose about a tree level or two by not balancing the tree. Given that balanced trees are slower to insert elements into and are harder to implement, I wouldn't bother with them. However, if you already have access to nice debugged component libraries (eg: C++'s STL), then you might as well go ahead and use the balanced tree.

T.E.D.
Whilst I do agree with your point about HashTables, my answer was for a very specific use-case - read once, few additions (if any) and lots of reads - therefore assuming the HashTable was of the correct size (autogrowing or set as 1.2x size of input) it is the best option.
JeeBee
Situations where you know the size of your input ahead of time are a rather unusual and special case. In that special case, sure, use a hash table. But Ben gave no indication whatsoever that *his* case met this rare condition.
T.E.D.
+4  A: 

It sounds like the following may all be true:

  • Your keys are strings.
  • Inserts are done once.
  • Lookups are done frequently.
  • The number of key-value pairs is relatively small (say, fewer than a K or so).

If so, you might consider a sorted list over any of these other structures. This would perform worse than the others during inserts, as a sorted list is O(N) on insert, versus O(1) for a linked list or hash table, and O(log2N) for a balanced binary tree. But lookups in a sorted list may be faster than any of these others structures (I'll explain this shortly), so you may come out on top. Also, if you perform all your inserts at once (or otherwise don't require lookups until all insertions are complete), then you can simplify insertions to O(1) and do one much quicker sort at the end. What's more, a sorted list uses less memory than any of these other structures, but the only way this is likely to matter is if you have many small lists. If you have one or a few large lists, then a hash table is likely to out-perform a sorted list.

Why might lookups be faster with a sorted list? Well, it's clear that it's faster than a linked list, with the latter's O(N) lookup time. With a binary tree, lookups only remain O(log2 N) if the tree remains perfectly balanced. Keeping the tree balanced (red-black, for instance) adds to the complexity and insertion time. Additionally, with both linked lists and binary trees, each element is a separately-allocated1 node, which means you'll have to dereference pointers and likely jump to potentially widely varying memory addresses, increasing the chances of a cache miss.

As for hash tables, you should probably read a couple of other questions here on StackOverflow, but the main points of interest here are:

  • A hash table can degenerate to O(N) in the worst case.
  • The cost of hashing is non-zero, and in some implementations it can be significant, particularly in the case of strings.
  • As in linked lists and binary trees, each entry is a node storing more than just key and value, also separately-allocated in some implementations, so you use more memory and increase chances of a cache miss.

Of course, if you really care about how any of these data structures will perform, you should test them. You should have little problem finding good implementations of any of these for most common languages. It shouldn't be too difficult to throw some of your real data at each of these data structures and see which performs best.

  1. It's possible for an implementation to pre-allocate an array of nodes, which would help with the cache-miss problem. I've not seen this in any real implementation of linked lists or binary trees (not that I've seen every one, of course), although you could certainly roll your own. You'd still have a slightly higher possibility of a cache miss, though, since the node objects would be necessarily larger than the key/value pairs.
P Daddy
+1  A: 

Other comments have focused on adding/retrieving elements, but this discussion isn't complete without considering what it takes to iterate over the entire collection. The short answer here is that hash tables require less memory to iterate over, but trees require less time.

For a hash table, the memory overhead of iterating over the (key, value) pairs does not depend on the capacity of the table or the number of elements stored in the table; in fact, iterating should require just a single index variable or two.

For trees, the amount of memory required always depends on the size of the tree. You can either maintain a queue of unvisited nodes while iterating or add additional pointers to the tree for easier iteration (making the tree, for purposes of iteration, act like a linked list), but either way, you have to allocate extra memory for iteration.

But the situation is reversed when it comes to timing. For a hash table, the time it takes to iterate depends on the capacity of the table, not the number of stored elements. So a table loaded at 10% of capacity will take about 10 times longer to iterate over than a linked list with the same elements!