Hello!
I wonder, how are systems for fulltext search implemented, to be able to query millions of entries very fast? Please note: I'm not talking about systems which tokenize the content by separating it at whitespaces, but about system which are able to query even parts from the middle of tokens (which is a real challange).
Background information
I experimented with a home-made string cacher (using Java) which is able to search
for strings, given a substring as query. The substring is not required
to be at the begin of potential retrieved strings.
It works on a huge array of strings.
Caching is done using a
TreeMap<Character,TreeSet<String>>
.
Adding entry
For each unique character in the to-be-added string:
Get the set for that character, and add the string to it.
Example: "test" is first split in "t", "e", "s".
Then, we retrieve the sets for those
three keys, and add "test" to each of the set.
Querieng
Querying is done by splitting the query into unique characters,
retrieve for every character a Set<String>
, build an intersection of
all sets, and finally search the intersection using contains()
to ensure correct
order of query characters.
Benchmark
On a 3GHz machine, I added 2'000'000 strings with average length
of 10, random content.
Done 100 queries. It took: Min: 0.4sec, Average: 0.5sec, Max: 0.6sec.
1.5GB of memory were wasted.