tags:

views:

53

answers:

1

Hello

I have a StandardAnalyzer working which retrieves words and frequencies from a single document using a TermVectorMapper which is populating a HashMap.

But if I use the following text as a field in my document, i.e.

addDoc(w, "lucene Lawton-Browne Lucene");

The word frequencies returned in the HashMap are:

browne 1 lucene 2 lawton 1

The problem is the words ‘lawton’ and ‘browne’. If this is an actual ‘double-barreled’ name, can Lucene recognise it as ‘Lawton-Browne’ where the name is actually a single word?

I’ve tried combinations of:

addDoc(w, "lucene \”Lawton-Browne\” Lucene");

And single quotes but without success.

Thanks

Mr Morgan.

A: 

Escape the characters

see Lucene Documentation here

http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Escaping%20Special%20Characters

Aaron Saunders
This might work in queryparsersyntax where the character is escaped but in my example, using addDoc(w, "lucene Lawton\\-Browne Lucene"); the output remains unchanged. I've tried a WhitespaceAnalyzer which gives me the name as one word but this doesn't count duplicates of the same word as one word.
Mr Morgan
I believe a WhitespaceAnalyzer should work fine. Can you please post some more code around the addDoc in order to clarify this?
Yuval F
A WhiteSpaceAnalyzer seems to give me what I want if I set all the tokens to lower case before the call to the analyser itself. But this type doesn't allow stop words which is a bit of a bind.
Mr Morgan