ansaurus

Question

Super fast autocomplete using binary search in sorted file (300000 lines)

Answer 1

+5 A:

Why don't you just use a SQLite DB rather than a text file?
I don't think you can do anything better speed-wise than a portable database in your situation.

Razor 2010-09-15 15:29:15

Internally, SQLite also uses binary search. But SQLite stores data in a btree, which means less file access operations are required than when using a sorted flat file. And the SQLite database is most likely smaller than a fixed size flat file.

Thomas Mueller 2010-09-15 15:34:12

I want a feature to download additional data which would make using a db more complicated. And would it be really faster?

fhucho 2010-09-15 15:40:06

I'm pretty sure it would be a LOT faster but you won't know until you test.

Falmarri 2010-09-15 17:57:30

@fhucho Adding more data to a DB rather than a text file is hardly more complicated. Also bear in mind that you'd have to implement **and optimize** the search logic yourself if you don't use a DB.

Razor 2010-09-15 18:52:46

@Razor Adding 300000 rows to database would be very slow and would slow down user's phone for some time. That would result in bad user experience.

fhucho 2010-09-15 19:10:18

@fhucho why can't you just distribute your app with the DB insertions *already* done? SQLite DBs consist of a single file, populate it on your PC or something, and add the binary ready made DB to the files necessary to your program.

Razor 2010-09-15 20:55:25

@Razor my mistake, I thought I can't use a downloaded file as a db. But I will use a trie because I need to search word-starting substrings (I updated the question). Thanks for help anyway...

fhucho 2010-09-15 21:33:32

worth noting that with SQL you can search word beginning, end, middle, parts of it... With the solution you headed for, if the user looks for "angeles", it wouldn't find a "los angeles" in your list. Might not be something you are interested in tho.

Razor 2010-09-16 09:41:45

Answer 2

+1 A:

check this out http://en.wikipedia.org/wiki/Binary_search_algorithm

in a sorted file you have a binary search worst case of O(log(n)) the next best thing would be some sort of hashmapping whic goes O(1) altough this is complicated for partial words and will produce a huge mapping table.

Joe Hopfgartner 2010-09-15 15:29:57

Answer 3

+2 A:

One strategy could be to narrow the results using the RandomAccessFile and Binary Search. Then, once the possible entries is small enough, load that portion into memory, and do an in memory search.

This will improve performance, because as people type you can quickly search the same portion of the file that you have loaded in memory.

jjnguy 2010-09-15 15:35:05

Good idea, thanks.

fhucho 2010-09-15 15:51:04

Answer 4

+1 A:

Preprocess your possibilities into a search tree ahead of time, instead of doing it at runtime.

Thorbjørn Ravn Andersen 2010-09-15 15:40:28

Answer 5

+1 A:

I would suggest to see if you can use a standard library for this purpose. Maybe apache lucene can be used in android phones. If so, you can build an index (word prefix -> an id of a word in the android sql lite). Here is a discussion about a kind of algorithm lucene is using.

Skarab 2010-09-15 15:44:35

Here is an example of an application that uses lucence index internally -https://tech.lds.org/wiki/index.php/Gospel_Library_for_Android. So it should be possible for you to use it. If lucene is not performing well, you can start to hack ;) .

Skarab 2010-09-15 16:04:18

Answer 6

+1 A:

A major problem with one-word-per-line storage is that there is no random access for lines in constant time (accessing line X consists in counting X newline characters from the beginning of the file) so your binary search would suffer.

What you need in this specific (auto-complete) situation is a Prefix Tree or a variation of it (combining several nodes into one, or turning sub-trees smaller than a certain size into plain old sorted list of words).

Victor Nicollet 2010-09-15 15:45:23

Well, thats why he said each line would be the same length.. Move forward 20 lines == move forward (20 x line-length) characters... However, I'm in the trie camp.

romacafe 2010-09-15 18:02:17

Answer 7

A:

I could also do something like this (below is a preprocessed file):

aa - line 1
ab - line 17
.
.
zz - line 299819

If user inputs something starting with aa, I would read lines 1 - 17 and sequentially search in them

fhucho 2010-09-15 15:49:42

Answer 8

+4 A:

What your looking for is called a TRIE

http://forums.sun.com/thread.jspa?threadID=5295936

In computer science, a trie, or prefix tree, is an ordered tree data structure that is used to store an associative array where the keys are usually strings. Unlike a binary search tree, no node in the tree stores the key associated with that node; instead, its position in the tree shows what key it is associated with. All the descendants of a node have a common prefix of the string associated with that node, and the root is associated with the empty string. Values are normally not associated with every node, only with leaves and some inner nodes that correspond to keys of interest.

Ryan 2010-09-15 16:09:29

Because you're developing for mobile, you might be concerned about space. (A true digital trie can be fairly space-intensive) A (very old) article about a nice compromise is http://www.drdobbs.com/184410528, but the links to the illustrations no longer work :(

romacafe 2010-09-15 18:09:09

...Perhaps a better article can be found here: http://www.javaworld.com/javaworld/jw-02-2001/jw-0216-ternary.html. Still old, though!

romacafe 2010-09-15 18:17:40

I finally decided to use trie instead of binary search, because I need to search word-starting substrings too.

fhucho 2010-09-15 20:45:52

Answer 9

+1 A:

100ms is plenty of time. The biggest worry would be the display updates, I'd think.

If you're wanting to avoid an actual database, this is easy enough to do with a simple index file in addition to your main file.

You could store the first N bytes (4 maybe?) of the string and a file offset into the main file in a index every 32 records or so, and binary search across that. You could then linearly search through up to 32 records after a binary search got you pretty close.

You can tune the index frequency from 32 records to whatever makes sense given your average string length and the size of a single read on your media. If you had 512 byte filesystem reads, and 8 byte average strings, then you'd do an index every 64 records, etc. There's not much point in having more than one index record per minimum disk read size.

The index file could be generated easily, and you could then manage the main file with a simple text editor.

darron 2010-09-15 16:37:28

I will probably do something like that. The only problem is that if I wanted an ability to search for "middle words" (user types "squar" -> "Trafalgar square" appears in autocomplete dropdown), the main file would have to be much bigger.

fhucho 2010-09-15 17:19:14

@fhucho: Excellent point. I suppose if you want substring matches your index is going to be something like a disk-based TRIE and pretty huge. My first-pass idea on how to do that would need like N! indexes per string of length N. Does anyone else know a good way to do substring autocomplete efficiently?

darron 2010-09-15 18:17:59

Just to clarify, I want to search only word-starting substrings. So "quare" doesn't match "Trafalgar square", but "squa" does. I may also firstly show lines that starts with the entered string. Secondly I would perform a slower serach that matches word-starting substring.

fhucho 2010-09-15 18:46:18

Answer 10

+1 A:

Trie is the obvious answer, and already mentioned, but additionally tr13 library might be what you are looking at. It is garbage collector friendly (single raw byte array or byte buffer), compact, and definitely fast enough for your case. Keys are typically UTF-8 strings although can be any byte sequences. Values likewise, although there is also alternative for variable-length ints (vints) used to get very compact String-to-int lookups (esp. for smallish set of ints).

StaxMan 2010-09-15 17:29:48

ansaurus

tags:

views:

answers:

Super fast autocomplete using binary search in sorted file (300000 lines)

related questions