views:

550

answers:

3

What are the main differences between search engines (DtSearch , Lucene.net, Sphinx, Google etc) that should influence the decision as to which to use to search proprietary data?

The data to be searched consists of presentation-free data that is marked up with metadata in the form of name/value pairs. We’re not interested in the format parsing abilities of the tools various. Also, the search results need to be well structured, presentation-free data that is amenable to aggregating with search results from other (similarly structured repositories.

Some relevant search engine characteristics that need to inform the decision are listed below. Futther suggestions or description of experiences welcome.

• Cost • Ease of use • Can be configured to return specific tags only • Can ‘identify’ specific terms give search results higher weighting for these results • Fast < 0.3seconds to return search results or %E6 records/documents • Support tags with types (find weather=’sunny’ but not personality=sunny) • Support weightings to give relevancy ranking • Return results in ranked order by relevency • Supports Synonyms • Supports stemmings • Supports Stop words • Supports spelling corrections • Amenable to parallelisation or index building (if index based) • Fast to reindex (if index based) • Fast to update index (if index based) • Combine results from multiple indexes (if index based) • Proximity checks: give higher relevance to words found close together

+1  A: 

In relation to relevancy, the Google Search Appliance allows a little tweaking. They believe that allowing too much tweaking will give poor relevancy, and I do believe that Google knows relevancy.

It is unlikely that users will find a search engine other than Google easier to use.

Liam
+2  A: 

I like Solr with the DataImportHandler. It supports most of your bullet points, and is not too difficult to set up, as long as you don't mind editing some XML configuration files. It's easier than many enterprise class search engines.

There is nothing wrong with GSA (Google Search Appliance), but for the amount of control that you desire, Solr is a better option.

Lucene/Solr

Geordie
A: 

You can't go wrong with Solr. It is the most flexible, reliable, powerful and scalable information retrieval platform I know of. And it is free!!!

I used DtSearch for many months before moving to Solr. DtSearch is very limited in terms of configuration, features and schema options. If you plan to do facetting look nowhere but Solr. DtSearch facetting is slow.

David