views:

113

answers:

2

Hi,

If i want Lucene to preserve dots of acronyms(example: U.K,U.S.A. etc), which analyzer do i need to use and how? I also want to input a set of stop words to Lucene while doing this.

A: 

StandardTokenizer preserves the dots occurring between letters. You can use StandardAnalyzer which uses StandardTokenizer. Or you could create your own analyzer with StandardTokenizer.

Correction: StandardAnalyzer will not help as it uses StandardFilter, which removes the dots from the acronym. You can construct your own analyzer with StandardTokenizer and additional filters (such as lower case filter) minus the StandardFilter.

Shashikant Kore
thanks for ur comments...FYI, i'm already using StandardAnalyzer in my code:protected readonly StandardAnalyzer _analyzer = new StandardAnalyzer(stop_words);but it removes dots from acronyms...
Jimmy
+1  A: 

A WhiteSpaceAnalyzer will preserve the dots. A StopFilter removes a list of stop words. You should define exactly the analysis you need, and then combine analyzers and token filters to achieve it, or write your own analyzer.

Yuval F