views:

148

answers:

1

Hello,

My question in a nutshell: Does anyone know of a TwitterAnalyzer or TwitterTokenizer for Lucene?

More detailed version:

I want to index a number of tweets in Lucene and keep the terms like @user or #hashtag intact. StandardTokenizer does not work because it discards the punctuation (but it does other useful stuff like keeping domain names, email addresses or recognizing acronyms). How can I have an analyzer which does everything StandardTokenizer does but does not touch terms like @user and #hashtag?

My current solution is to preprocess the tweet text before feeding it into the analyzer and replace the characters by other alphanumeric strings. For example,

String newText = newText.replaceAll("#", "hashtag");
newText = newText.replaceAll("@", "addresstag");

Unfortunately this method breaks legitimate email addresses but I can live with that. Does that approach make sense?

Thanks in advance!

Amaç

+2  A: 

The StandardTokenizer and StandardAnalyzer basically pass your tokens through a StandardFilter (which removes all kinds of characters from your standard tokens like 's at ends of words), followed by a Lowercase filter (to lowercase your words) and finally by a StopFilter. That last one removes insignificant words like "as", "in", "for", etc.

What you could easily do to get started is implement your own analyzer that performs the same as the StandardAnalyzer but uses a WhitespaceTokenizer as the first item that processes the input stream.

For more details one the inner workings of the analyzers you can have a look over here

Thomas
Thanks. I already tried implementing my own Analyzer by using WhitespaceTokenizer instead of StandardTokenizer. But that leaves host names, email addresses, and some other stuff unrecognized and tokenized erroneously.I would like to process a stream with my custom TwitterTokenizer (which handles @s and #s does nothing else) then feed the resulting stream into a StandardTokenizer and go on from there. However, as far as I understand an Analyzer can have only one Tokenizer at the beginning of the chain.
Amaç Herdağdelen
Another approach could be to use PerFieldAnalyzerWrapper and make a second pass through the content to explicitely look for hash tags and user references and put them in a separate field of your document (e.g. 'tags' and 'replies'). The analyzers for those field then only return tokens for occurences of #tag and @user respectively.
Thomas
Yeah, that makes sense. Thanks!
Amaç Herdağdelen