views:

87

answers:

3

Hello!

I wonder, how are systems for fulltext search implemented, to be able to query millions of entries very fast? Please note: I'm not talking about systems which tokenize the content by separating it at whitespaces, but about system which are able to query even parts from the middle of tokens (which is a real challange).

Background information
I experimented with a home-made string cacher (using Java) which is able to search for strings, given a substring as query. The substring is not required to be at the begin of potential retrieved strings.

It works on a huge array of strings. Caching is done using a
TreeMap<Character,TreeSet<String>>.

Adding entry
For each unique character in the to-be-added string:
Get the set for that character, and add the string to it.

Example: "test" is first split in "t", "e", "s".
Then, we retrieve the sets for those three keys, and add "test" to each of the set.

Querieng
Querying is done by splitting the query into unique characters, retrieve for every character a Set<String>, build an intersection of all sets, and finally search the intersection using contains() to ensure correct order of query characters.

Benchmark
On a 3GHz machine, I added 2'000'000 strings with average length of 10, random content.
Done 100 queries. It took: Min: 0.4sec, Average: 0.5sec, Max: 0.6sec.
1.5GB of memory were wasted.

A: 

You may want to look at Lucene. But I think in general, they do tokenize the input text. Maybe not just by whitespace, but also using shorter in-word sequences. I do not think one-character tokens are feasible.

For oriental languages (where there is no whitespace) bi-grams are often used, which are two-character sequences. The main difference to English is that two characters are often already a word, and the base character set to draw from is much bigger, so that there is already a lot of information in the bi-gram, and there are much more unique bi-grams.

Thilo
+1  A: 

One way to do it is to store the sorting permutation of all tails of your text (text from certain point up until end).

Then to find a substring you binary search for it in those cyclic shifts. Memory used using 32 bit ints would be 4 bytes per original character.

p.s: I heard there is a way to accomplish a similar thing by storing the Burrows-Wheeler transform of the text (1 charecter per original charecter) but I can't seem to find any references to it..

yairchu
Permutations! THAT'S a great idea! I'll just split each "word" into characters, sort them, and store the result as map key (such result would be the "initial permutation", or "permutation 0"). Querying would be done similar!
ivan_ivanovich_ivanoff
Oh, sorry, my proposal with permutations would not allow to search of partial character sequences :(
ivan_ivanovich_ivanoff
+1  A: 

I implemented such a system, for one of those suggest drop downs on a website, using n-gram indexing, 3-grams in particular. You split a word into its constituent n-grams, such as for the word "hello" you would get "hel", "lo". Then you build an index with n-grams as keys, and the words they come from as the values. (I used a trie for speed, memory was a lesser concern). Next, for a given query, you break it up into n-grams by the same process as during indexing, and perform a lookup on each n-gram getting a list of possible matches. From that list you select the words that had the highest number of matched n-grams. There are various heuristics you can use as well. One is that matches at the beginning of a word are usually more important, so you can pad all words with a $.

eulerfx