views:

128

answers:

7

Hi everyone! I'm considering of data structure for storing a large array of strings in a memory. Strings will be inserted at the beginning of the programm and will not be added or deleted while programm is running. The crucial point is that search procedure should be as fast as it can be. Saving of memory is not important. I incline to standard structure hash_set from standard library, that allows to search elements in the structure with about constant time. But it's not guaranteed that this time will be short. Will anyone suggest a better standard desicion?

Many thanks!

A: 

A hash_set with a suitable number of buckets would be ideal, alternatively a vector with the strings in dictionary order, searched used binary search, would be great too.

Will A
+3  A: 

Try a Prefix Tree

A Trie is better than a Binary Search Tree for searching elements. Compared against a hash table, you could see this question

Marco
*snicker* trie a prefix tree *snicker* :)
Billy ONeal
I thought of that too, but then realized that all the pointer-chasing it entails is massively inefficient. Also, it makes lookups run in O(n) where n is the length of the string.
Clark Gaebel
Judy arrays are quite exciting too, although I've never had cause to use one.
Steve Jessop
If n is the length of the string, then calculating a hash value is O(n) as well. Apart from that, - excluding special cases - a naive hash table might be faster than a naive trie. Admittedly, it does not have the same functionality though, e.g. prefix matching, ordered enumeration.
andras
As @andras pointed out, a hash table is also O(n) on the length of the string. Hash lookups are O(1) only if the time needed to compute the hash function is constant, or low enough to be ignored.
MAK
@MAK: And in practice for strings you usually have to do an equality test after the hash check anyway, so there's little point in hashes that avoid using every byte of the string.
Steve Jessop
A: 

There is e.g. google-sparsehash. It includes a dense hash set/map (re)implementation that may perform better than the standard library hash set/map. See performance. Make sure that you are using a good hash function. (My subjective vote: murmur2.)

Strings will be inserted at the beginning of the programm and will not be added or deleted while programm is running.

If the strings are immutable - so insertion/deletion is "infrequent", so to speak -, another option is to build a Directed Acyclic Word Graph or a Compact Directed Acyclic Word Graph that might* be faster than a hash table and has a better worst case guarantee.

*Standard disclaimer applies: depending on the use case, implementations, data set, phase of the moon, etc. Theoretical expectations may differ from observed results because of factors not accounted for (e.g. cache and memory latency, time complexity of certain machine instructions, etc.).

andras
A: 

Well, assuming you truly want an array and not an associative contaner as you've mentioned, the allocation strategy mentioned in Raymond Chen's Blog would be efficient.

Billy ONeal
A: 

The two standard data structures for fast string lookup are hash tables and tries, particularly Patricia tries. A good hash implementation and a good trie implementation should give similar performance, as long as the hash implementation is good enough to limit the number of collisions. Since you never modify the set of strings, you could try to build a perfect hash. If performance is more important than development time, try all solutions and benchmark them.

A complementary technique that could save lookups in the string table is to use atoms: each time you read a string that you know you're going to look up in the table, look it up immediately, and store a pointer to it (or an index in the data structure) instead of storing the string. That way, testing the equality of two strings is a simple pointer or integer equality (and you also save memory by storing each string once).

Gilles
And you can do some fancy stuff to look up atoms quickly - for example store them sorted first by length and secondly by lexicographical order. Store a pointer to the first and last string of each given length. Then you can lookup by first jumping to the right block, then binary chop.
Steve Jessop
A: 

Your best bet would be as follows:

  1. Building your structure:
    1. Insert all your strings (char*s) into an array.
    2. Sort the array lexicographically.
  2. Lookup
    1. Use a binary search on your array.

This maintains cache locality, allows for efficient lookup (Will search in a space of ~4 billion strings with 32 comparisons), and is dead simple to implement. There's no need to get fancy with tries, because they are complicated, and slower than they appear (especially if you have long strings).

Random sidenote: Combined with http://blogs.msdn.com/b/oldnewthing/archive/2005/05/19/420038.aspx, you'll be unstoppable!

Clark Gaebel
@Clark: hmm, is there some way that, once you know the order of strings, you can build a new stringpool which keeps strings close together that are likely to be the "next" comparison? So for example with 128 strings and imagine I can fit about 8 to a page, numbers 64, 32, 96, 16, 48, 80, 112 should all be on the same page, then numbers 1-8 should be on another page, 9-15 on another, and so on. Hence only two pages hit per lookup, guaranteed. Obviously it's a bit more fiddly than that since the strings won't all be the same length...
Steve Jessop
... and of course it's likely *way* more trouble than it's worth, since if you're doing a lot of lookups you'll end up with the whole thing in cache the whole time anyway. But the questioner did say lookup was the *only* crucial thing, so presumably has a lot of development time on his hands ;-)
Steve Jessop
@Clark: Yes, only 32 comparisons, but you should note that those are string comparisons, which could be relatively expensive, rather than character comparisons.
Billy ONeal
@Billy: I don't think so. For one it hits the cache a lot less, and two it exits early and will only compare the whole string on success.
Clark Gaebel
"it hits the cache a lot less" - probably, comparing naive implementations. A Patricia Tree could actually hit fewer cache lines than this solution does: log(n) nodes (same as your log(n) strings), and less data visited per node (minimal differentiating string fragment). Unfortunately, binary chops and simple binary trees for strings share the property that as you get lexicographically closer to your target, quite likely you get several strings with common prefixes, so you get "slow" failed comparisons. Tries look good when you have 100k strings starting, "http://stackoverflow.com/questions/".
Steve Jessop
@Clark: We are usually talking about worst case in algorithm design, not best case. Yes, for a small number of strings, the comparison will probably be faster than a hash. But when you get close to `2^30` strings, the comparison function is much more likely to be linear in the length of the contained strings.
Billy ONeal
... because you compare that 35-byte prefix once instead of log(100k) = 17 times. Which admittedly is still peanuts, but we're presupposing that performance here really is critical, meaning that it's many millions of peanuts. So I don't think we can reject a well-tuned trie-derivative without testing it.
Steve Jessop
@Billy: I don't think you want to talk about "worst case" and "hash" in the same sentence. Hashtable lives by yelling "expected case! expected case! Assume average rate of collisions!" whenever anyone looks at it funny ;-)
Steve Jessop
@Steve: I was not talking about a hash table -- my comment is strictly on this answer, which would be to use an ordered associative container.
Billy ONeal
+2  A: 

If lookup time really is the only important thing, then at startup time, once you have all the strings, you could compute a perfect hash over them, and use this as the hashing function for a hashtable.

The problem is how you'd execute the hash - any kind of byte-code-based computation is probably going to be slower than using a fixed hash and dealing with collisions. But if all you care about is lookup speed, then you can require that your process has the necessary privileges to load and execute code. Write the code for the perfect hash, run it through a compiler, load it. Test at runtime whether it's actually faster for these strings than your best known data-agnostic structure (which might be a Trie, a hashtable, a Judy array or a splay tree, depending on implementation details and your typical access patterns), and if not fall back to that. Slow setup, fast lookup.

It's almost never truly the case that speed is the only crucial point.

Steve Jessop