views:

229

answers:

1

I have a Lucene index that has several documents in it. Each document has multiple fields such as:

Id
Project
Name
Description

The Id field will be a unique identifier such as a GUID, Project is a user's ProjectID and a user can only view documents for their project, and Name and Description contain text that can have special characters.

When a user performs a search on the Name field, I want to be able to attempt to match the best I can such as:

First

Will return both:

First.Last 

and

First.Middle.Last

Name can also be something like:

Test (NameTest)

Where, if a user types in 'Test', 'Name', or '(NameTest)', then they can find the result.

However, if I say that Project is 'ProjectA' then that needs to be an exact match (case insensitive search). The same goes with the Id field.

Which fields should I set up as Tokenized and which as Untokenized? Also, is there a good Analyzer I should consider to make this happen?

I am stuck trying to decide the best route to implement the desired searching.

+1  A: 

Your ID field should be untokenized for simple reason it does not appear it can be tokenized (whitespace based) unless you write your own tokenizer. You can Tokenize all your other fields.

Perform a phrase query on the project name, look up PhraseQuery or enclose your project name in double quotes (which will make it match exactly). Example: "\"My Fancy Project"\"

For the name field a simple query should work fine.

Unsure if there are situations where you want a combination of fields. In that situation look up BooleanQuery (which allows you to combine different queries boolean-ly)

Mikos
I do plan on being able to do a boolean query across both Name and Description for something like 'test'. In that case, I want to return all documents that contain test in either field. I would like my queries scoped by a project Id. Example: (name or description contains 'test') AND project id = 3 (exact match)I presume project Id would be untokenized and name and description would be tokenized using a standard analyzer. Would a standard booleanquery using the QueryParser class achieve my goal?
Brandon
Yes the above should work. If your project id is likely to be just a number or some identifier (a "term" in Lucene terms), you can use a TermQuery.
Mikos
I followed what you said, but am running into a bit of a hiccup. When inserting a tokenized field, I escape the special characters. When performing the search using a QueryParser, I escape the search value before performing the search using a StandardAnalyzer. One problem is that if I have 2 objects in my index and their names are 'Test' and 'Test (Test)' respectively, when I perform a search for 'Test (Test)' and escape the special characters, I get back both objects. I know it is creating 2 terms 'Test' and '\(Test\)' from my input, but it doesn't make sense why it gets both.
Brandon
I should add that I picture it would perform the 'AND' operation on the terms to match documents with a Field/Value pair that met all the Term criteria.
Brandon