ansaurus

Question

How to implement autocomplete on a massive dataset

Answer 1

+3 A:

As I pointed out in How to implement incremental search on a list you should use structures like a Trie or Patricia trie for searching patterns in large texts.

And for discovering patterns in the middle of some text there is one simple solution. I am not sure if it is the most efficent solution, but I usually do it as follows.

When I insert some new text into the Trie, I just insert it, then remove the first charachter, insert again, remove the second charachter, insert again ... and so on until the whole text is consumed. Then you can discover every substring of every inserted text by just one search from the root.

And it is really incredible fast. To find all texts that contain a given sequence of n characters you have to inspect at most n nodes and perform a search on the list of childs for every node. Depending on the implementation (array, list, binary tree, skiplist) of the child node collection, you might be able to identify the required child node with as few as 5 search steps assuming case insensitive latin letters only. Interpolation sort might be helpful for large alphabets and nodes with many childs as those usually found near the root.

Daniel Brückner 2009-03-24 20:00:57

Trie works great for finding matches at the beginning of a string. However, with my current dataset the process of removing the first char and then inserting didn't end up working very well, just started using way too much memory: > 1 gig before it was half done with the dataset.

aquinas 2009-03-25 13:17:26

May be a case of premature optimization, when I just ran a naive "contains" search, the runtime is less than 100 milliseconds. Lucene also looks really cool, so I might try that for fun. Another idea would be to use a combination of trie and naive search.

aquinas 2009-03-25 13:20:01

Start with trie and if you have less than 20 starts with matches, fall back to naive. Why can I only insert 300 characters??!

aquinas 2009-03-25 13:20:52

Answer 2

+2 A:

I would use something along the lines of a trie, and have the value of each leaf node be a list of the possibilities that contain the word represented by the leaf node. You could sort them in order of likelihood, or dynamically sort/filter them based on other words the user has entered into the search box, etc. It will execute very quickly and in a reasonable amount of RAM.

rmeador 2009-03-24 20:01:13

Answer 3

A:

You keep the items on the server side (perhaps in a DB, if the dataset is really large and complex) and you send AJAX calls from the client's browser that return the results using json/xml. You can do this in response to the user typing, or with a timer.

Assaf Lavie 2009-03-24 20:02:53

Answer 4

A:

Not algorithmically related to what you are asking, but make sure you have a 200ms or more delay (lag) after the kaypress(es) so you ensure that the user has stopped typing before issuing the asynchronous request. That way you will reduce redundant http requests to the server.

cherouvim 2009-03-24 20:22:26

Answer 5

A:

Don't try to implement this yourself (unless you're just curious). Use something like Lucene or Endeca - it will save you time and hair.

Jim Arnold 2009-03-24 20:56:11

Lucene seems really cool, thanks for the suggestion! But, yeah, of COURSE I'm curious! :)

aquinas 2009-03-25 13:21:26

ansaurus

tags:

views:

answers:

How to implement autocomplete on a massive dataset

related questions