ansaurus

Question

Answer 1

+3 A:

Lucene can do this very efficiently if one creates an "index-only" field with only the third and fourth digits in it. The complete value can be "stored" (or stored and indexed if other queries use the whole number) in the original field.

Update: A followup comment asked, "Is [there] a way to create a temporary index on only the second digit?"

Using a ParallelReader "vertically partitions" the fields of an index. One partition could hold the current index, with its fields, while the other is a temporary index with the new field, possibly stored in a RAMDirectory.

Assuming the number is "stored" in the original index, iterate over each document in the original index, retrieve the stored field, parse out the key digits, and add a Document to the temporary index with the new field. As the ParallelReader documentation states, it is imperative that the document numbers match in both indexes.

erickson 2009-04-20 18:01:18

And if I don't have the possibility to add another index?We have already an index on those number. Is their a way to create a temporary index on only the second digit?

Khan 2009-04-23 18:02:44

Answer 2

+1 A:

Thank you erickson, Your solution is probably the best, using ParallelReader if only I could use temporary indexes, cause we cache the search query, we will need those later.

But like you said before, better start with an index on the relevant digits straighaway.

I have another solution.

NOT field:0?55*
NOT field:1?55*
...
NOT field:9?55*

It is efficient enough for the search I'm doing and it bypass the first character wildcard limitation. I wouldn't use that if their where more digits to check or if they where farther from the start. Now I'm testing this on a million of row and it's pretty efficient for our needs.

Khan 2009-04-24 13:30:10

Yes, that's a good workaround too. If you are sure that the first two characters in field are always 01-99, can you just use "NOT field:??55*"

erickson 2009-04-24 16:06:54

I tried that, the limitations is this one :http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Wildcard%20Searches"Note: You cannot use a * or ? symbol as the first character of a search"

Khan 2009-09-10 15:46:38

ansaurus

tags:

views:

answers:

Lucene number extracting

related questions