How do I setup Lucene so that I can search ignoring whitespace characters?

views:

answers:

+2 Q:

How do I setup Lucene so that I can search ignoring whitespace characters?

For example, a list of part numbers includes:

JRB-1000
JRB 1000
JRB1000
JRB100-0
-JRB1000

If a user searches on 'JRB1000', or 'JRB 1000' I would like to return a match for all the part numbers above.

+1 A:

Write a custom Analyzer that either splits these into several tokens (JRB, 1000; relatively easy and forgiving to users) or concatenates them into a single token (JRB1000; hard but precise). Implementing your own Analyzer amounts to overriding the tokenStream argument in an existing one and perhaps writing a custom TokenFilter class.

Apply your new Analyzer on both documents being indexed and queries.

(Links are for the Java version, but .NET should be similar.)

larsmans 2010-10-12 15:43:38

If your analyzer just removes spaces and dashes, and then uses what remains as the tokens, it may suffice.

Yuval F 2010-10-13 07:01:04

"Removing spaces" means either default behavior (which doesn't work) or treating everything as one token. It's the cases `JRB1000` -> `JRB 1000` and vice versa that cause the trouble here. (Unless the part number is a separate field?)

larsmans 2010-10-13 08:10:55

Yes part number is a separate field. I have managed to get this mostly working with a custom analyzer and tokenizer that removes the spaces and dashes and uses the result as the token. This works when searching for JRB1000 however, it is not working when searching for 'JRB 1000' despite passing the custom analyzer in to the QueryParser. I'm beginning to think that Lucene may not be the right tool for the job here, if all it is doing is stripping the spaces and dashes from the index and query I could quite easily do this by adding a lookup table to my database.

ChrisR 2010-10-13 12:26:48

Have you checked whether the contents of your index are correct? You can do that using Luke (http://www.getopt.org/luke/). That's a Java tool, but it *should* work for Lucene.Net as the index format is identical.

larsmans 2010-10-13 12:29:35

Yep, have used Luke to verify that the index is OK.

ChrisR 2010-10-13 12:43:54

I've decided to use a lookup table in my database rather than use Lucene for this. I've accepted your answer larsmans as it actually answered my question, thanks for your help. The code for the custom analyzer and tokenizer is available on github if anyone else needs it: http://gist.github.com/624076

ChrisR 2010-10-13 13:59:49

ansaurus

tags:

views:

answers:

How do I setup Lucene so that I can search ignoring whitespace characters?

related questions