views:

41

answers:

2

I want to index a "compound word" like "New York" as a single term in Lucene not like "new", "york". In such a way that if someone searches for "new place", documents containing "new york" won't match.

I think this is not the case for N-grams (actually NGramTokenizer), because I won't index just any n-gram, I want to index only some specific n-grams.

I've done some research and I know I should write my own Analyzer and maybe my own Tokenizer. But I'm a bit lost extending TokenStream/TokenFilter/Tokenizer.

Thanks

+1  A: 

I presume you have some way of detecting the multi-word units (MWUs) that you want to preserve. Then what you can do is replace the whitespace in them by an underscore and use a WhiteSpaceAnalyzer instead of a StandardAnalyzer (which throws away punctuation), perhaps with a LowerCaseFilter.

Writing your own Tokenizer requires quite some Lucene black magic. I've never been able to wrap my head around the Lucene 2.9+ APIs, but check out the TokenStream docs if you really want to try.

larsmans
A: 

I did it by creating the field which is indexed but not analyzed. For this I used the Field.Index.NOT_ANALYZED > doc.add(new Field("fieldName", "value", Field.Store.YES, Field.Index.NOT_ANALYZED, TermVector.YES)); the StandardAnalyzer.

I worked on Lucene 3.0.2.

Jakub
But that way each field would contatin only one compound word, right? As the field wouldn't be splitted into parts, Lucene will think all the field is a single word, right?
Felipe Hummel
That's true that the field value would not be splitted into parts.Given the String "one two three" as the value it will be stored as one token. For me it does not matter because I store the entities extracted by the lingpipe: one entity - one term.
Jakub