views:

1679

answers:

10

What is a fast and efficient way to implement the server-side component for an autocomplete feature in an html input box?

I am writing a service to autocomplete user queries in our web interface's main search box, and the completions are displayed in an ajax-powered dropdown. The data we are running queries against is simply a large table of concepts our system knows about, which matches roughly with the set of wikipedia page titles. For this service obviously speed is of utmost importance, as responsiveness of the web page is important to the user experience.

The current implementation simply loads all concepts into memory in a sorted set, and performs a simple log(n) lookup on a user keystroke. The tailset is then used to provide additional matches beyond the closest match. The problem with this solution is that it does not scale. It currently is running up against the VM heap space limit (I've set -Xmx2g, which is about the most we can push on our 32 bit machines), and this prevents us from expanding our concept table or adding more functionality. Switching to 64-bit VMs on machines with more memory isn't an immediate option.

I've been hesitant to start working on a disk-based solution as I am concerned that disk seek time will kill performance. Are there possible solutions that will let me scale better, either entirely in memory or with some fast disk-backed implementations?

Edits:

@Gandalf: For our use case it is important the the autocompletion is comprehensive and isn't just extra help for the user. As for what we are completing, it is a list of concept-type pairs. For example, possible entries are [("Microsoft", "Software Company"), ("Jeff Atwood", "Programmer"), ("StackOverflow.com", "Website")]. We are using Lucene for the full search once a user selects an item from the autocomplete list, but I am not yet sure Lucene would work well for the autocomplete itself.

@Glen: No databases are being used here. When I'm talking about a table I just mean the structured representation of my data.

@Jason Day: My original implementation to this problem was to use a Trie, but the memory bloat with that was actually worse than the sorted set due to needing a large number of object references. I'll read on the ternary search trees to see if it could be of use.

+4  A: 

With a set that large I would try something like a Lucene index to find the terms you want, and set a timer task that gets reset after every key stroke, with a .5 second delay. This way if a user types multiple characters fast it doesn't query the index every stroke, only when the user pauses for a second. Useability testing will let you know how long that pause should be.

Timer findQuery = new Timer();
...
public void keyStrokeDetected(..) {
   findQuery.cancel();
   findQuery = new Timer();
   String text = widget.getEnteredText();
   final TimerTask task = new TimerTask() {
      public void run() {
         ...query Lucene Index for matches
      }
   };
   findQuery.schedule(task, 350); //350 ms delay
}

Some pseduocode there, but that's the idea. Also if the query terms are set the Lucene Index can be pre-created and optimized.

Gandalf
i don't think they keystroke stuff here is really necessary, as that doesn't sound like the problem. But i do agree that you might want to put all of your content into a lucene index. Lucene is insanely fast for this kind of thing.
John Gardner
A: 

If you can't physically load all the data into RAM then you're going to have to deal with having some on disk.

What DB are you using?

For example Oracle has an option where you can keep the entire table in memory, and perform your queries against that.

MySQL also claims to have some in-memory capabilities, but I don't know much about MySQL.

You can then do away with your java based cache, or you could use the cache for the most popular/recent searches.

Obviously when you run out of RAM then some of the data will be on disk when you query for it, but depending on the load on the system, this will only be an issue for the first keypress, not the subsequent ones, as the row will be in memory after that.

If the disk seek is slowing you down, then you could investigate using SSD drives to speed up your reads.

Glen
+3  A: 

I had a similar requirement.

I used relational database with a single well-indexed synthetic table (avoiding joins and views to speed up lookups), and in-memory cache (Ehcache) to store most used entries.

By using MRU cache you'll be able to have instant response times for most of the lookups, and there is probably nothing that can beat relational database in accessing indexed column in a big table stored on disk.

This is solution for big datasets you can't store on the client and it works pretty fast (non-cached lookup was always retrieved under 0.5 seconds in my case). It's also horizontally scalable - you can always add additional servers and database servers.

You could also play with caching of only the most used results on the client, especially if you've already implemented it. In my case, server-side solution is fast enough, and client load times are slow enough as it is, so it's not warrantied.

P.S. Having client query only when user pauses for a certain amount of time to avoid repeated lookups as suggested is a good solution. On my client, I query database only after first three characters are entered, since less than that returns too many results in all instances.

Domchi
A: 

Maybe I misunderstood your question but couldn't you use a JQuery plugin to Ajax the info to your app?

I have used this one before:

Ajax Auto Suggest v2

Antonio Louro
On the web interface side I am using jQuery for the ajax callback. I'm talking about the server side of things here.
toluju
+1  A: 

I've done this for small data sets using a Ternary search tree. The DDJ code is not too difficult to convert to Java, but it assumes the entire data set will fit into memory. There are on-disk implementations of Ternary search trees (here is one in python), but of course they are going to be less performant. Since ternary search trees excel at partial matches, though, the performance might be suitable for your needs.

Jason Day
A: 

Are there possible solutions that will let me scale better

Yes, Oracle. This is kind of thing that databases are built for. Just index the relevant columns. If you are running against the wall of in-memory solutions, then the trade-off with disk seek time or network latency is probably moot. Especially if you insert a caching layer in between.

Also, you may be able to decrease the number of hits if you tweak your client-side code a little. Such as setting a minimum number of type characters before a query is run or setting a fraction of a second of delay after the user stops typing. If you are already using those, set them a bit higher.

jiggy
+1  A: 

I ended up resolving this one via Lucene; the initial performance tests seem sufficient for our use case. A little hacking was necessary to make the prefix queries work, as I was running into the TooManyClauses exception when expanding queries such as "Jeff At*". I ended up wrapping my IndexReader with a FilterIndexReader, and set hard cap on the number of terms returned on a prefix term call. Here's my code:

Directory directory = FSDirectory.getDirectory(indexDir);
IndexReader reader = IndexReader.open(directory);
FilterIndexReader filteredReader = new FilterIndexReader(reader) {
  @Override public TermEnum terms(Term t) throws IOException {
    final TermEnum origEnum = super.terms(t);

    return new TermEnum() {
      protected int count = 0;
      @Override public boolean next() throws IOException {
        if (count++ < (BooleanQuery.getMaxClauseCount() - 10))
          return origEnum.next();
        else return false;
      }

      @Override public Term term() {
        return origEnum.term();
      }

      @Override public int docFreq() {
        return origEnum.docFreq();
      }

      @Override public void close() throws IOException {
        origEnum.close();
      }
    };
  }
};

IndexSearcher searcher = new IndexSearcher(filteredReader);
toluju
+2  A: 

For those who stumble upon this question...

I just posted a server-side autocomplete implementation on Google Code. The project includes a java library that can be integrated into existing applications and a standalone HTTP AJAX autocomplete server.

My hope is that enables people to incorporate efficient autocomplete into their applications. Kick the tires!

Shilad Sen
How to start server? java -jar autocomplete-server-0.3.jar does not work? Thanks for the info
Alfred
Good question. I added an example to the autocomplete-server homepage and I added a new version (0.4).
Shilad Sen
Thanks for the feedback.
Alfred
A: 

I used hashtable and mmap() And 10,000,000+ records term list isn't problem. See demo here: http://olegh.ath.cx/autocomplete.html

maxihatop
A: 

I see all kinds of suggestions have been made. We've tried several approaches and finally decided on one that is elegant, scales well, and so on.

If you need AutoComplete for Solr, here is something that does it:

http://sematext.com/products/autocomplete/index.html

You can see how (well) it works on http://search-lucene.com/ .

Otis Gospodnetic