tags:

views:

131

answers:

2

I'm trying to write a filter for Lucene, similar to StopWordsFilter (thus implementing TokenFilter), but I need to remove phrases (sequence of tokens) instead of words.

The "stop phrases" are represented themselves as a sequence of tokens: punctuation is not considered.

I think I need to do some kind of buffering of the tokens in the token stream, and when a full phrase is matched, I discard all tokens in the buffer.

What would be the best approach to implements a "stop phrases" filter given a stream of words like Lucene's TokenStream?

A: 

You'll really have to write your own Analyzer, I should think, since whether or not some sequence of words is a "phrase" is dependent on cues, such as punctuation, that are not available after tokenization.

Jonathan Feinberg
Actually punctuation can be discarded: I need to match phrases which can themselves be described as word tokens
Enrico Detoma
Please edit your question to make this clear.
Jonathan Feinberg
A: 

In this thread I was given a solution: use Lucene's CachingTokenFilter as a starting point:

That solution was actually the right way to go.

Enrico Detoma