Lucene stop phrases filter

tags:

lucene

views:

131

answers:

+1 Q:

Lucene stop phrases filter

I'm trying to write a filter for Lucene, similar to StopWordsFilter (thus implementing TokenFilter), but I need to remove phrases (sequence of tokens) instead of words.

The "stop phrases" are represented themselves as a sequence of tokens: punctuation is not considered.

I think I need to do some kind of buffering of the tokens in the token stream, and when a full phrase is matched, I discard all tokens in the buffer.

What would be the best approach to implements a "stop phrases" filter given a stream of words like Lucene's TokenStream?

You'll really have to write your own Analyzer, I should think, since whether or not some sequence of words is a "phrase" is dependent on cues, such as punctuation, that are not available after tokenization.

Jonathan Feinberg 2009-10-07 15:49:57

Actually punctuation can be discarded: I need to match phrases which can themselves be described as word tokens

Enrico Detoma 2009-10-07 15:54:39

Please edit your question to make this clear.

Jonathan Feinberg 2009-10-07 15:59:42

In this thread I was given a solution: use Lucene's CachingTokenFilter as a starting point:

That solution was actually the right way to go.

Enrico Detoma 2009-10-15 22:46:01

ansaurus

tags:

views:

answers:

Lucene stop phrases filter

related questions