views:

487

answers:

2

Hi, I have this number extracting problem. I want to get all matches that don't have a certain number in it ex : 125501874, 125001873 Every number that as 55 at the position 2 are not to be considered.

The first numbers range is 0 to 9 and the second is 1-9 so the real range is [01-99] (we cannot have 00 as the first two number)

With Lucene I wanted to add NOT field:[01-99]55*

But it doesn't seem to work. Is there an easy way to find ??55* and disregard it in a Search("NOT field:[01-99]55*")?

Thank you Lucene guru

+3  A: 

Lucene can do this very efficiently if one creates an "index-only" field with only the third and fourth digits in it. The complete value can be "stored" (or stored and indexed if other queries use the whole number) in the original field.


Update: A followup comment asked, "Is [there] a way to create a temporary index on only the second digit?"

Using a ParallelReader "vertically partitions" the fields of an index. One partition could hold the current index, with its fields, while the other is a temporary index with the new field, possibly stored in a RAMDirectory.

Assuming the number is "stored" in the original index, iterate over each document in the original index, retrieve the stored field, parse out the key digits, and add a Document to the temporary index with the new field. As the ParallelReader documentation states, it is imperative that the document numbers match in both indexes.

erickson
And if I don't have the possibility to add another index?We have already an index on those number. Is their a way to create a temporary index on only the second digit?
Khan
+1  A: 

Thank you erickson, Your solution is probably the best, using ParallelReader if only I could use temporary indexes, cause we cache the search query, we will need those later.

But like you said before, better start with an index on the relevant digits straighaway.

I have another solution.

NOT field:0?55*
NOT field:1?55*
...
NOT field:9?55*

It is efficient enough for the search I'm doing and it bypass the first character wildcard limitation. I wouldn't use that if their where more digits to check or if they where farther from the start. Now I'm testing this on a million of row and it's pretty efficient for our needs.

Khan
Yes, that's a good workaround too. If you are sure that the first two characters in field are always 01-99, can you just use "NOT field:??55*"
erickson
I tried that, the limitations is this one :http://lucene.apache.org/java/2_3_2/queryparsersyntax.html#Wildcard%20Searches"Note: You cannot use a * or ? symbol as the first character of a search"
Khan