views:

239

answers:

5

What are good data structures for auto-completion algorithms? What data structures allow for efficiently finding strings containing a particular substring?

+2  A: 

If you are looking to do something similar to the way Google implements it's autocomplete, you might want to check out a ternary search tree:

http://igoro.com/archive/efficient-auto-complete-with-a-ternary-search-tree/

However, if you want to find any random substring within a string, try a Generalised suffix tree.

http://en.wikipedia.org/wiki/Generalised_suffix_tree

Corey Sunwold
Doesn't that only work if you only want to match prefixes? e.g. a ternary search tree helps you match "ab" in "abcd", but not "bc" in "abcd" (might be being thick, don't know much about ternary search trees, and only gave the link a fleeting glance).
Dominic Rodger
I think so, in general it does work in a x "starts with" y sort of way. However, in practice this seems to be how all autocomplete features that I have ever used worked.
Corey Sunwold
@Corey - fwiw a number of the auto-complete widgets I use day-to-day match anywhere in the string; nonetheless - useful link, so +1.
Dominic Rodger
@Dominic Roger Can you provide some examples? I haven't seen anything like that, but I'm interested to know where something like that would be useful.
Corey Sunwold
@Corey - the one I'm thinking of is in our inhouse bug-reporting system, where you can look up people by any part of their name (I tend to use surnames, because there are more of them, so you get narrowed-down results quicker - and in a `<forename> <surname>` system, surnames are obviously not prefixes).
Dominic Rodger
+2  A: 

Check out suffix array and suffix tree.

Nick D
Man, I've been looking for Ukkonen's algorithm for years and never knew about it! I have an application that needs efficient matching of substrings with errors. I've even asked in forums like this in the past and not gotten any good pointers. You've made my day!
swestrup
@swestrup: I'm glad I helped you tracing that info :) You should get a copy of *The Algorithm Design Manual*, http://www.amazon.com/Algorithm-Design-Manual-Steven-Skiena/dp/1848000693/ref=sr_1_1?ie=UTF8)
Nick D
+1  A: 

http://en.wikipedia.org/wiki/Trie

frankc
+1  A: 

If you're doing prefixes (which is what most autocompletes do) then a ternary search tree is also what I'd recommend. If you're doing general infixes, then go with a suffix tree, as mentioned above.

swestrup
Nah, its a dumb idea. Use suffix trees. Much better.
swestrup
@swestrup - if it's dumb, edit your answer
Dominic Rodger
A: 

As an alternative to Suffix Arrays, Trees and Tries, take a look at Directed Acyclic Word Graphs (DAWGs) and the Compressed variant (CDAWGs). They can be built in linear time, take up linear space, and allow for substring search.

With a more complicated search function, you can even support a limited set of wildcards.

Lucas