tags:

views:

84

answers:

1

In my Lucene documents I have a field "company" where the company name is tokenized. I need the tokenization for a certain part of my application. But for this query, I need to be able to create a PrefixQuery over the whole company field.

Example:

  • My Brand
    • my
    • brand
  • brahmin farm
    • brahmin
    • farm

Regularly querying for "bra" would return both documents because they both have a term starting with bra.
The result I want though, would only return the last entry because the first term starts with bra.

Any suggestions?

+1  A: 

Create another indexed field, where the company name is not tokenized. When necessary, search on that field rather than the tokenized company name field.


If you want fast searches, you need to have index entries that point directly at the records of interest. There might be something that you can to with the proximity data to filter records, but it will be slow. I see the problem as: how can a "contains" query over a complete field be performed efficiently?

You might be able to minimize the increase in index size by creating (for each current field) a "first term" field and "remaining terms" field. This would eliminate duplication of the first term in two fields. For "normal" queries, you look for query terms in either of these fields. For "startswith" queries, you search only the "first term" field. But this seems like more trouble than it's worth.

erickson
This would be a solution, but would also increase my index quite a lot. I would have to duplicate all my fields that way (about 15) for 2500K+ records. I was hoping to find a way to simply do a startswith over a complete field
borisCallens