views:

26

answers:

1

I have been reading the new 2nd edition of the Lucene in Action and they give an example of doing highlighting but unfortunately it requires the original text so it can get the position of terms etc. The highlighter is the official one in contrib, so that implies its the sponsorted or official highlighter.

Does anyone know of another highlighter that does not require the original text but works using the term positions (sorry if i got the terminology wrong) ???

A: 

Both the standard highlighter and FastVectorHighlighter can use the index if you store the terms. (FVH can only use the index, in fact). You can see an example of this on page 274 of Lucene in Action. The relevant code line is:

TokenStream stream = TokenSources.getAnyTokenStream(searcher.getIndexReader(), sd.doc, "title", doc, analyzer);

That will get the token stream from the index.

Xodarap
Thinking a bit more, i think my original q is flawed because not all terms are stored (eg stop words) so its not possible to build an accurate original fragment for highlighting purposes. Is this a correct assumption ?
mP
If you analyze your text in a way which removes stop words, then yes, stop words will be removed. What I do is have two copy fields, one which is indexed but not stored, the other stored but not indexed. The indexed one is stemmed etc. The stored one just uses a whitespace tokenizer. This actually takes the same amount of space as a stored+indexed field, and will get over the issue you described about stop words being removed.
Xodarap
which basically amounts to storing the original text in full form along with the analyzed form - thanks for the tips...
mP