views:

305

answers:

3

I have a dataset of 200million+ records and am looking to build a dedicated backend to power a type ahead solution. Lucene is of interest given its popularity and license type, but I'm open to other open source suggestions as well. I am looking for advice, tales from the trenches, or even better direct instruction on what I will need as far as amount of hardware and structure of software. Requirements:

Must have:

  • The ability to do starts with substring matching (I type in 'st' and it should match 'Stephen')
  • The ability to return results very quickly, I'd say 500ms is an upper bound.

Nice to have:

  • The ability to feed relevance information into the indexing process, so that, for example, more popular terms would be returned ahead of others and not just alphabetical, aka Google style.
  • In-word substring matching, so for example ('st' would match 'bestseller')

Note:

  • This index will purely be used for type ahead, and does not need to serve standard search queries.
  • I am not worried about getting advice on how to set up the front end or AJAX, as long as the index can be queried as a service or directly via Java code.

Up votes for any useful information that allows me to get closer to an enterprise level type ahead solution

+3  A: 

If each record is relatively small (less than a few words) you can try a Trie data structure:

http://en.wikipedia.org/wiki/Trie

It is built for lightening fast prefix matching and it is relatively space efficient. I've used this data structure for the exact auto-complete functionality you're looking for, and I know of others that have done this for high volume production websites. In my experience you can expect response times of tens of milliseconds for a single query.

You can implement a Trie yourself pretty easily, or there are implementations you can download. See

http://stackoverflow.com/questions/623892/where-do-i-find-a-standard-trie-based-map-implementation-in-java

Depending on what implementation you use, it should be relatively simple to tag each indexed record with a relevance score, which you can then use to sort by when you get a list of records from a query.

bajafresh4life
+1  A: 

You might not need anything too fancy. Your "must have" list can be met by a simple database engine (e.g. BerkeleyDB or ESENT). Put all the words into a table and then use seeks to look up the words.

A b-tree with 8kb pages should get at least 250 strings/page, which would result in 1M leaf pages, giving a b-tree of height 3. Even with a 5400 RPM laptop disk, I/O latency is less than 15ms so, in the worst case, you will be able to get results in ~50ms in the worst case (completely uncached data and a slow drive).

(I built a typeahead application that uses the ESENT-based PersistentDictionary class. With 200K records I get a ~35ms response for the first lookup, where the data isn't cached at all. After doing a bunch of queries the response time drops to ~5ms).

To support a lot of concurrent users you can either add more cache or faster disks. Completely caching all the data is probably possible (8GB of RAM is quite affordable these days) and the typeahead data will certainly be small enough to fit on an SSD, which would provide a ridiculous number of IOPS. I might go for the SSD because that will give great performance even when the cache is cold (e.g. after a restart).

A database-engine based solution should be extremely quick to build.

Laurion Burchall
+1  A: 

Here is how we do it in SOLR:

The key to the search if to have the correct datatype with the appropriate filter factories.

Setup a datatype in your schema called textPrefix

Example:

<!--
 This type is used for type ahead style searching and starts with searching. 
-->
−
<fieldType name="textPrefix" class="solr.TextField" positionIncrementGap="1" sortMissingLast="true">
−
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!-- Remove non alpha-numeric characters -->
<filter class="solr.PatternReplaceFilterFactory" pattern="[^a-zA-Z0-9 ]" replacement="" replace="all"/>
<filter class="solr.TrimFilterFactory"/>
<!-- Remove leading "the "-->
<filter class="solr.PatternReplaceFilterFactory" pattern="^the\s" replacement="" replace="all"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="6"/>
</analyzer>
−
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<!-- Remove non alpha-numeric characters -->
<filter class="solr.PatternReplaceFilterFactory" pattern="[^a-zA-Z0-9 ]" replacement="" replace="all"/>
<filter class="solr.TrimFilterFactory"/>
<!-- Remove leading "the "-->
<filter class="solr.PatternReplaceFilterFactory" pattern="^the\s" replacement="" replace="all"/>
<filter class="solr.TrimFilterFactory"/>
</analyzer>
</fieldType>

Then in your schema document create a new data field as such:

<field name="CustomerNamePrefix" type="textPrefix" indexed="true" stored="false"/>

Then store a copy of the Customer Name into this CustomerNamePrefix field.

Now when you query against this field you can simply use the first letters of the name and it will give you best match for those letters. Depending on how you do your query you could boost results based on other factors inside you document.

Example:

http://solrserver:8080/solr/select/?q=CustomerNamePrefix:jame&amp;q.alt=*:*&amp;start=0&amp;rows=10&amp;mm=1
JamesR