views:

43

answers:

1

For example, a list of part numbers includes:

JRB-1000
JRB 1000
JRB1000
JRB100-0
-JRB1000

If a user searches on 'JRB1000', or 'JRB 1000' I would like to return a match for all the part numbers above.

+1  A: 

Write a custom Analyzer that either splits these into several tokens (JRB, 1000; relatively easy and forgiving to users) or concatenates them into a single token (JRB1000; hard but precise). Implementing your own Analyzer amounts to overriding the tokenStream argument in an existing one and perhaps writing a custom TokenFilter class.

Apply your new Analyzer on both documents being indexed and queries.

(Links are for the Java version, but .NET should be similar.)

larsmans
If your analyzer just removes spaces and dashes, and then uses what remains as the tokens, it may suffice.
Yuval F
"Removing spaces" means either default behavior (which doesn't work) or treating everything as one token. It's the cases `JRB1000` -> `JRB 1000` and vice versa that cause the trouble here. (Unless the part number is a separate field?)
larsmans
Yes part number is a separate field. I have managed to get this mostly working with a custom analyzer and tokenizer that removes the spaces and dashes and uses the result as the token. This works when searching for JRB1000 however, it is not working when searching for 'JRB 1000' despite passing the custom analyzer in to the QueryParser. I'm beginning to think that Lucene may not be the right tool for the job here, if all it is doing is stripping the spaces and dashes from the index and query I could quite easily do this by adding a lookup table to my database.
ChrisR
Have you checked whether the contents of your index are correct? You can do that using Luke (http://www.getopt.org/luke/). That's a Java tool, but it *should* work for Lucene.Net as the index format is identical.
larsmans
Yep, have used Luke to verify that the index is OK.
ChrisR
I've decided to use a lookup table in my database rather than use Lucene for this. I've accepted your answer larsmans as it actually answered my question, thanks for your help. The code for the custom analyzer and tokenizer is available on github if anyone else needs it: http://gist.github.com/624076
ChrisR