I'm trying to write a filter for Lucene, similar to StopWordsFilter (thus implementing TokenFilter), but I need to remove phrases (sequence of tokens) instead of words.
The "stop phrases" are represented themselves as a sequence of tokens: punctuation is not considered.
I think I need to do some kind of buffering of the tokens in the token stream, and when a full phrase is matched, I discard all tokens in the buffer.
What would be the best approach to implements a "stop phrases" filter given a stream of words like Lucene's TokenStream?