views:

287

answers:

10

In my Android app I want to have an input field with autocomplete. The number of items will be about 300000. The best solution seems to be to put the items into a file (on sdcard), one item per line, each line would have the same number of characters so that I can seek to specific line number. If the user enters something in the text field, I would binary search (via RandomAccessFile) the file and show suggestions.

I want the autocomplete to be super fast (ideally under 100ms but I guess it's impossible), what optimizations I can do?

Update 1: I will convert the users input to lowercase english characters (a-z) with spaces. So 'A/b' would be converted to 'a b' and then searched.

Uodate 2: I now realized I need additional thing - to search for word-starting substrings.

+5  A: 

Why don't you just use a SQLite DB rather than a text file?
I don't think you can do anything better speed-wise than a portable database in your situation.

Razor
Internally, SQLite also uses binary search. But SQLite stores data in a btree, which means less file access operations are required than when using a sorted flat file. And the SQLite database is most likely smaller than a fixed size flat file.
Thomas Mueller
I want a feature to download additional data which would make using a db more complicated. And would it be really faster?
fhucho
I'm pretty sure it would be a LOT faster but you won't know until you test.
Falmarri
@fhucho Adding more data to a DB rather than a text file is hardly more complicated. Also bear in mind that you'd have to implement **and optimize** the search logic yourself if you don't use a DB.
Razor
@Razor Adding 300000 rows to database would be very slow and would slow down user's phone for some time. That would result in bad user experience.
fhucho
@fhucho why can't you just distribute your app with the DB insertions *already* done? SQLite DBs consist of a single file, populate it on your PC or something, and add the binary ready made DB to the files necessary to your program.
Razor
@Razor my mistake, I thought I can't use a downloaded file as a db. But I will use a trie because I need to search word-starting substrings (I updated the question). Thanks for help anyway...
fhucho
worth noting that with SQL you can search word beginning, end, middle, parts of it... With the solution you headed for, if the user looks for "angeles", it wouldn't find a "los angeles" in your list. Might not be something you are interested in tho.
Razor
+1  A: 

check this out http://en.wikipedia.org/wiki/Binary_search_algorithm

in a sorted file you have a binary search worst case of O(log(n)) the next best thing would be some sort of hashmapping whic goes O(1) altough this is complicated for partial words and will produce a huge mapping table.

Joe Hopfgartner
+2  A: 

One strategy could be to narrow the results using the RandomAccessFile and Binary Search. Then, once the possible entries is small enough, load that portion into memory, and do an in memory search.

This will improve performance, because as people type you can quickly search the same portion of the file that you have loaded in memory.

jjnguy
Good idea, thanks.
fhucho
+1  A: 

Preprocess your possibilities into a search tree ahead of time, instead of doing it at runtime.

Thorbjørn Ravn Andersen
+1  A: 

I would suggest to see if you can use a standard library for this purpose. Maybe apache lucene can be used in android phones. If so, you can build an index (word prefix -> an id of a word in the android sql lite). Here is a discussion about a kind of algorithm lucene is using.

Skarab
Here is an example of an application that uses lucence index internally -https://tech.lds.org/wiki/index.php/Gospel_Library_for_Android. So it should be possible for you to use it. If lucene is not performing well, you can start to hack ;) .
Skarab
+1  A: 

A major problem with one-word-per-line storage is that there is no random access for lines in constant time (accessing line X consists in counting X newline characters from the beginning of the file) so your binary search would suffer.

What you need in this specific (auto-complete) situation is a Prefix Tree or a variation of it (combining several nodes into one, or turning sub-trees smaller than a certain size into plain old sorted list of words).

Victor Nicollet
Well, thats why he said each line would be the same length.. Move forward 20 lines == move forward (20 x line-length) characters... However, I'm in the trie camp.
romacafe
A: 

I could also do something like this (below is a preprocessed file):

aa - line 1
ab - line 17
.
.
zz - line 299819 

If user inputs something starting with aa, I would read lines 1 - 17 and sequentially search in them

fhucho
+4  A: 

What your looking for is called a TRIE

http://forums.sun.com/thread.jspa?threadID=5295936

In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.

Ryan
Because you're developing for mobile, you might be concerned about space. (A true digital trie can be fairly space-intensive) A (very old) article about a nice compromise is http://www.drdobbs.com/184410528, but the links to the illustrations no longer work :(
romacafe
...Perhaps a better article can be found here: http://www.javaworld.com/javaworld/jw-02-2001/jw-0216-ternary.html. Still old, though!
romacafe
I finally decided to use trie instead of binary search, because I need to search word-starting substrings too.
fhucho
+1  A: 

100ms is plenty of time. The biggest worry would be the display updates, I'd think.

If you're wanting to avoid an actual database, this is easy enough to do with a simple index file in addition to your main file.

You could store the first N bytes (4 maybe?) of the string and a file offset into the main file in a index every 32 records or so, and binary search across that. You could then linearly search through up to 32 records after a binary search got you pretty close.

You can tune the index frequency from 32 records to whatever makes sense given your average string length and the size of a single read on your media. If you had 512 byte filesystem reads, and 8 byte average strings, then you'd do an index every 64 records, etc. There's not much point in having more than one index record per minimum disk read size.

The index file could be generated easily, and you could then manage the main file with a simple text editor.

darron
I will probably do something like that. The only problem is that if I wanted an ability to search for "middle words" (user types "squar" -> "Trafalgar square" appears in autocomplete dropdown), the main file would have to be much bigger.
fhucho
@fhucho: Excellent point. I suppose if you want substring matches your index is going to be something like a disk-based TRIE and pretty huge. My first-pass idea on how to do that would need like N! indexes per string of length N. Does anyone else know a good way to do substring autocomplete efficiently?
darron
Just to clarify, I want to search only word-starting substrings. So "quare" doesn't match "Trafalgar square", but "squa" does. I may also firstly show lines that starts with the entered string. Secondly I would perform a slower serach that matches word-starting substring.
fhucho
+1  A: 

Trie is the obvious answer, and already mentioned, but additionally tr13 library might be what you are looking at. It is garbage collector friendly (single raw byte array or byte buffer), compact, and definitely fast enough for your case. Keys are typically UTF-8 strings although can be any byte sequences. Values likewise, although there is also alternative for variable-length ints (vints) used to get very compact String-to-int lookups (esp. for smallish set of ints).

StaxMan