What are the main differences between search engines (DtSearch , Lucene.net, Sphinx, Google etc) that should influence the decision as to which to use to search proprietary data?
The data to be searched consists of presentation-free data that is marked up with metadata in the form of name/value pairs. We’re not interested in the format parsing abilities of the tools various. Also, the search results need to be well structured, presentation-free data that is amenable to aggregating with search results from other (similarly structured repositories.
Some relevant search engine characteristics that need to inform the decision are listed below. Futther suggestions or description of experiences welcome.
• Cost • Ease of use • Can be configured to return specific tags only • Can ‘identify’ specific terms give search results higher weighting for these results • Fast < 0.3seconds to return search results or %E6 records/documents • Support tags with types (find weather=’sunny’ but not personality=sunny) • Support weightings to give relevancy ranking • Return results in ranked order by relevency • Supports Synonyms • Supports stemmings • Supports Stop words • Supports spelling corrections • Amenable to parallelisation or index building (if index based) • Fast to reindex (if index based) • Fast to update index (if index based) • Combine results from multiple indexes (if index based) • Proximity checks: give higher relevance to words found close together