ansaurus

Question

How can I set up Solr to tokenize on whitespace and punctuation?

Answer 1

+1 A:

how about

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" splitOnNumerics="0" />

that should prevent one2three4 to be split

Raoul Duke 2010-10-08 13:39:31

That's what I thought, but the `generateWordParts` does split on numerics regardless of that setting. http://i.imgur.com/WpgCl.png

claytron 2010-10-08 14:01:30

Do you have it correctly configured for query time? In your OP I only see an index time analyzer defined. It works for me with solr 1.4 so I guess it's either a bug in 1.3 or a configuration issue on your part.

Raoul Duke 2010-10-08 14:09:39

It does the same for query time. I'm starting to think that it is a bug in 1.3.0 also.

claytron 2010-10-08 14:12:21

It's certainly possible. I had to submit a few patches to fix bugs in 1.3 myself :-/

Raoul Duke 2010-10-08 14:15:40

That was it. Tested on 1.4.1 and it worked as expected. Thanks!

claytron 2010-10-08 15:07:48

Turns out it wasn't a bug. The `splitOnNumerics` option wasn't added until the Solr 1.4 release. If the Solr wiki wasn't read-only I'd make note about the inclusion of those options in version 1.4.

claytron 2010-10-12 14:35:18

Ah, good to know, thanks!

Raoul Duke 2010-10-12 22:37:05

ansaurus

tags:

views:

answers:

How can I set up Solr to tokenize on whitespace and punctuation?

related questions