tags:

views:

4003

answers:

3

I've been using the (Java) Highlighter for Lucene (in the Sandbox package) for some time. However, this isn't really very accurate when it comes to matching the correct terms in search results - it works well for simple queries, for example searching for two separate words will highlight both code fragments in the results.

However, it doesn't act well with more complicated queries. In the simplest case, phrase queries such as "Stack Overflow" will match all occurrences of Stack or Overflow in the highlighting, which gives the impression to the user that it isn't working very well.

I tried applying the fix here but that came with a lot of performance caveats, and at the end of the day was just plain unusable. The performance is especially an issue on wildcard queries. This is due to the way that the highlighting works; instead of just working on the querystring and the text it parses it as Lucene would and then looks for all the matches that Lucene has made; unfortunately this means that for certain wildcard queries it can be looking for matches to 2000+ clauses on large documents, and it's simply not fast enough.

Is there any faster implementation of an accurate highlighter?

+1  A: 

You could look into using Solr. http://lucene.apache.org/solr

Solr is a sort of generic search application that uses Lucene and supports highlighting. It's possible that the highlighting in Solr is usable as an API outside of Solr. You could also look at how Solr does it for inspiration.

Sindri Traustason
Thanks, taking a look at Solr - I think I've always confused it with Nutch in the past and assumed they were the same thing, silly me. I notice in the Solr docs it seems to separate out a PhraseHighlighter and a standard Highlighter, so I'm not imbued with much confidence I'm afraid :(
Mat Mannion
Unfortunately, the solr highlighter just delegates to the highlighter in the Lucene Sandbox - it doesn't do anything clever :(
Mat Mannion
+1  A: 

I've been reading on the subject and came across spanQuery which would return to you the span of the matched term or terms in the field that matched.

dlamblin
+2  A: 

There is a new faster highlighter (needs to be patched in but will be part of release 2.9)

https://issues.apache.org/jira/browse/LUCENE-1522

and a back-reference to this question

pro
Thanks for pointing that out Peter, I'll give that a go and see if it's usable for us.
Mat Mannion