ansaurus

Question

Answer 1

+2 A:

The StandardTokenizer and StandardAnalyzer basically pass your tokens through a StandardFilter (which removes all kinds of characters from your standard tokens like 's at ends of words), followed by a Lowercase filter (to lowercase your words) and finally by a StopFilter. That last one removes insignificant words like "as", "in", "for", etc.

What you could easily do to get started is implement your own analyzer that performs the same as the StandardAnalyzer but uses a WhitespaceTokenizer as the first item that processes the input stream.

For more details one the inner workings of the analyzers you can have a look over here

Thomas 2010-04-01 06:19:55

Thanks. I already tried implementing my own Analyzer by using WhitespaceTokenizer instead of StandardTokenizer. But that leaves host names, email addresses, and some other stuff unrecognized and tokenized erroneously.I would like to process a stream with my custom TwitterTokenizer (which handles @s and #s does nothing else) then feed the resulting stream into a StandardTokenizer and go on from there. However, as far as I understand an Analyzer can have only one Tokenizer at the beginning of the chain.

Amaç Herdağdelen 2010-04-01 08:56:06

Another approach could be to use PerFieldAnalyzerWrapper and make a second pass through the content to explicitely look for hash tags and user references and put them in a separate field of your document (e.g. 'tags' and 'replies'). The analyzers for those field then only return tokens for occurences of #tag and @user respectively.

Thomas 2010-04-01 09:53:26

Yeah, that makes sense. Thanks!

Amaç Herdağdelen 2010-04-01 13:27:42

ansaurus

tags:

views:

answers:

Tokenizing Twitter Posts in Lucene

related questions