ansaurus

Question

How to structure an index for type ahead for extremely large dataset using Lucene or similar?

Answer 1

+3 A:

If each record is relatively small (less than a few words) you can try a Trie data structure:

http://en.wikipedia.org/wiki/Trie

It is built for lightening fast prefix matching and it is relatively space efficient. I've used this data structure for the exact auto-complete functionality you're looking for, and I know of others that have done this for high volume production websites. In my experience you can expect response times of tens of milliseconds for a single query.

You can implement a Trie yourself pretty easily, or there are implementations you can download. See

http://stackoverflow.com/questions/623892/where-do-i-find-a-standard-trie-based-map-implementation-in-java

Depending on what implementation you use, it should be relatively simple to tag each indexed record with a relevance score, which you can then use to sort by when you get a list of records from a query.

bajafresh4life 2010-05-05 02:40:00

Answer 2

+1 A:

You might not need anything too fancy. Your "must have" list can be met by a simple database engine (e.g. BerkeleyDB or ESENT). Put all the words into a table and then use seeks to look up the words.

A b-tree with 8kb pages should get at least 250 strings/page, which would result in 1M leaf pages, giving a b-tree of height 3. Even with a 5400 RPM laptop disk, I/O latency is less than 15ms so, in the worst case, you will be able to get results in ~50ms in the worst case (completely uncached data and a slow drive).

(I built a typeahead application that uses the ESENT-based PersistentDictionary class. With 200K records I get a ~35ms response for the first lookup, where the data isn't cached at all. After doing a bunch of queries the response time drops to ~5ms).

To support a lot of concurrent users you can either add more cache or faster disks. Completely caching all the data is probably possible (8GB of RAM is quite affordable these days) and the typeahead data will certainly be small enough to fit on an SSD, which would provide a ridiculous number of IOPS. I might go for the SSD because that will give great performance even when the cache is cold (e.g. after a restart).

A database-engine based solution should be extremely quick to build.

Laurion Burchall 2010-05-06 00:12:33

Answer 3

+1 A:

Here is how we do it in SOLR:

The key to the search if to have the correct datatype with the appropriate filter factories.

Setup a datatype in your schema called textPrefix

Example:

<!--
 This type is used for type ahead style searching and starts with searching. 
-->
−
<fieldType name="textPrefix" class="solr.TextField" positionIncrementGap="1" sortMissingLast="true">
−
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!-- Remove non alpha-numeric characters -->
<filter class="solr.PatternReplaceFilterFactory" pattern="[^a-zA-Z0-9 ]" replacement="" replace="all"/>
<filter class="solr.TrimFilterFactory"/>
<!-- Remove leading "the "-->
<filter class="solr.PatternReplaceFilterFactory" pattern="^the\s" replacement="" replace="all"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="6"/>
</analyzer>
−
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!-- Remove non alpha-numeric characters -->
<filter class="solr.PatternReplaceFilterFactory" pattern="[^a-zA-Z0-9 ]" replacement="" replace="all"/>
<filter class="solr.TrimFilterFactory"/>
<!-- Remove leading "the "-->
<filter class="solr.PatternReplaceFilterFactory" pattern="^the\s" replacement="" replace="all"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
</fieldType>

Then in your schema document create a new data field as such:

<field name="CustomerNamePrefix" type="textPrefix" indexed="true" stored="false"/>

Then store a copy of the Customer Name into this CustomerNamePrefix field.

Now when you query against this field you can simply use the first letters of the name and it will give you best match for those letters. Depending on how you do your query you could boost results based on other factors inside you document.

Example:

http://solrserver:8080/solr/select/?q=CustomerNamePrefix:jame&amp;q.alt=*:*&amp;start=0&amp;rows=10&amp;mm=1

JamesR 2010-09-24 00:17:13

ansaurus

tags:

views:

answers:

How to structure an index for type ahead for extremely large dataset using Lucene or similar?

related questions