views:

396

answers:

2

I'm implementing a "Google Suggest" like autocomplete feature for tag searching using jQuery's autocomplete.

I need to provide a web service to jQuery giving it a list of suggestions based on what the user has typed. I see 2 ways of implementing the web service:

1) just store all the tags in a database and search the DB using user input as prefix. This is simple, but I'm concerned about latency.

2) Use an in-process trie to store all the tags and search it for matching results. As everything will be in-process, I expect this to have much lower latency. But there are several difficulties: -What's a good way to initialize the trie on process start up? Presumable I'll store the tag data in a DB and retrieve them and turn them into a trie when I frist start up the process. But I'm not sure how. I'm using Python/Django. -When a new tag is created by a user, I need to insert the new tag into the trie. But let's say I have 5 Django processes and hence 5 tries, how do I tell the other 4 tries that they need to insert a new tag too? -How to make sure the trie is threadsafe as my Django processes will be threaded (I'm using mod_wsgi). Or do I not have to worry about threadsafty because of Python's GIL? -Any way I can store the tag's frequency of use within the trie as well? How do I tell when does the tag's string end and when does the frequency start - eg. if I store apple213 into the trie, is it "apple" with frequency 213 or is it "apple2" with frequency 13??

Any help on the issues above or any suggestions on a different approach would be really appreciated.

+1  A: 

I would use the first option. 'KISS' - (Keep It Simple Stupid).

For small amounts of data there shouldn't be much latency. We run the same kind of thing for a name search and results appear pretty quickly on a few thousand rows.

Hope that helps,

Josh

Josh
+3  A: 

Don't be concerned about latency before you measure things -- make up a bunch of pseudo-tags, stick them in the DB, and measure latencies for typical queries. Depending on your DB setup, your latency may be just fine and you're spared wasted worries.

Do always worry about threading, though - the GIL doesn't make race conditions go away (control might switch among threads at any pseudocode instruction boundary, as well as when C code in an underlying extension or builtin is executing). You need first to check the threadsafety attribute of the DB API module you're using (see PEP 249), and then use locking appropriately or spawn a small pool of dedicated threads that perform DB interactions (receiving requests on a Queue.Queue and returning results on another, the normal architecture for sound and easy threading in Python).

Alex Martelli