ansaurus

Question

Lucene Analyzer to Use With Special Characters and Punctuation?

Answer 1

+1 A:

Your ID field should be untokenized for simple reason it does not appear it can be tokenized (whitespace based) unless you write your own tokenizer. You can Tokenize all your other fields.

Perform a phrase query on the project name, look up PhraseQuery or enclose your project name in double quotes (which will make it match exactly). Example: "\"My Fancy Project"\"

For the name field a simple query should work fine.

Unsure if there are situations where you want a combination of fields. In that situation look up BooleanQuery (which allows you to combine different queries boolean-ly)

Mikos 2010-04-29 02:36:29

I do plan on being able to do a boolean query across both Name and Description for something like 'test'. In that case, I want to return all documents that contain test in either field. I would like my queries scoped by a project Id. Example: (name or description contains 'test') AND project id = 3 (exact match)I presume project Id would be untokenized and name and description would be tokenized using a standard analyzer. Would a standard booleanquery using the QueryParser class achieve my goal?

Brandon 2010-04-29 03:19:37

Yes the above should work. If your project id is likely to be just a number or some identifier (a "term" in Lucene terms), you can use a TermQuery.

Mikos 2010-04-29 03:34:22

I followed what you said, but am running into a bit of a hiccup. When inserting a tokenized field, I escape the special characters. When performing the search using a QueryParser, I escape the search value before performing the search using a StandardAnalyzer. One problem is that if I have 2 objects in my index and their names are 'Test' and 'Test (Test)' respectively, when I perform a search for 'Test (Test)' and escape the special characters, I get back both objects. I know it is creating 2 terms 'Test' and '\(Test\)' from my input, but it doesn't make sense why it gets both.

Brandon 2010-04-29 13:23:36

I should add that I picture it would perform the 'AND' operation on the terms to match documents with a Field/Value pair that met all the Term criteria.

Brandon 2010-04-29 13:24:40

ansaurus

tags:

views:

answers:

Lucene Analyzer to Use With Special Characters and Punctuation?

related questions