views:

86

answers:

3

Ok, tries have been around for a while. A typical implementation should give you O(m) lookup, insert and delete operations independently of the size n of the data set, where m is the message length. However, this same implementation takes up 256 words per input byte, in the worst case.

Other data structures, notably hashing, give you expected O(m) lookup, insertion and deletion, with some implementations even providing constant time lookup. Nevertheless, in the worst case the routines either do not halt or take O(nm) time.

The question is, is there a data structure that provides O(m) lookup, insertion and deletion time while keeping a memory footprint comparable to hashing or search trees?

It might be appropriate to say I am only interested in worst case behaviour, both in time and space-wise.

+1  A: 

Did you try Patricia-(alias critbit- or Radix-) tries? I think they solve the worst-case space issue.

Luther Blissett
What's the difference between a radix trie and a regular trie?
Gabe
Patricia tries seem a viable alternative. I will try to derive/research about the worst case. Do you have any references?
Luís Fernando Schultz Xavier
Gabe: They have strings instead of characters in their edges so as to avoid fan-out 1 nodes.
Luís Fernando Schultz Xavier
Here's a write up on DJB's (Dan Bernstein, of qmail fame) take on Patricia, crit-bit: http://www.imperialviolet.org/2008/09/29/critbit-trees.htmlMy default mode is, that if DJB uses it, it's probably great. I remember that he announced a paper on the performance, but I don't know if this happened already.
Luther Blissett
Thank you for the great reference! I will be looking for an analysis of this data structure.
Luís Fernando Schultz Xavier
A: 

There is a structure known as a suffix array. I can't remember the research in this area, but I think they've gotten pretty darn close to O(m) lookup time with this structure, and it is much more compact that your typical tree-based indexing methods.

Dan Gusfield's book is the Bible of string algorithms.

San Jacinto
Suffix arrays are usually static.
Luís Fernando Schultz Xavier
My apologies. I didn't know this was a requirement.
San Jacinto
A: 

I don't think there a reason to be worried about the worst case for two reasons:

  1. You'll never have more total active branches in the sum of all trie nodes than the total size of the stored data.
  2. The only time the node size becomes an issue is if there is huge fan-out in the data you're sorting/storing. Mnemonics would be an example of that. If you're relying on the trie as a compression mechanism, then a hash table would do no better for you.

If you need to compress and you have few/no common subsequences, then you need to design a compression algorithm based on the specific shape of the data rather than based on generic assumptions about strings. For example, in the case of a fully/highly populated mnemonic data set, a data structure that tracked the "holes" in the data rather than the populated data might be more efficient.

That said, it might pay for you to avoid a fixed trie node size if you have moderate fan-out. You could make each node of the trie a hash table. Start with a small size and increase as elements are inserted. Worst-case insertion would be c * m when every hash table had to be reorganized due to increases where c is the number of possible characters / unique atomic elements.

David Gladfelter
Point (1) is correct, but as for point (2), I am not interested in compression, my worry is that, as you pointed out, many low fan-out nodes may consume a lot of empty array entries. What I want is a bound on the number of words/bytes per byte inserted that my trie will have. It could be 4 or 8, but not 256*4 = 1K. And I am definitely not looking for something less than 1, i.e., I am not trying to do compression.
Luís Fernando Schultz Xavier