Preserving dots of an acronym while indexing in Lucene

views:

113

answers:

+2 Q:

Preserving dots of an acronym while indexing in Lucene

Hi,

If i want Lucene to preserve dots of acronyms(example: U.K,U.S.A. etc), which analyzer do i need to use and how? I also want to input a set of stop words to Lucene while doing this.

StandardTokenizer preserves the dots occurring between letters. You can use StandardAnalyzer which uses StandardTokenizer. Or you could create your own analyzer with StandardTokenizer.

Correction: StandardAnalyzer will not help as it uses StandardFilter, which removes the dots from the acronym. You can construct your own analyzer with StandardTokenizer and additional filters (such as lower case filter) minus the StandardFilter.

Shashikant Kore 2009-07-19 08:27:25

thanks for ur comments...FYI, i'm already using StandardAnalyzer in my code:protected readonly StandardAnalyzer _analyzer = new StandardAnalyzer(stop_words);but it removes dots from acronyms...

Jimmy 2009-07-19 17:18:45

+1 A:

A WhiteSpaceAnalyzer will preserve the dots. A StopFilter removes a list of stop words. You should define exactly the analysis you need, and then combine analyzers and token filters to achieve it, or write your own analyzer.

Yuval F 2009-07-20 08:37:44

ansaurus

tags:

views:

answers:

Preserving dots of an acronym while indexing in Lucene

related questions