views:

31

answers:

1

I have been trying to get my Solr schema (using Solr 1.3.0) to create terms that are tokenized by whitespace and punctuation. Here are some examples on what I would like to see happen:

terms given -> terms tokenized

foo-bar -> foo,bar
one2three4 -> one2three4
multiple words/and some-punctuation -> multiple,words,and,some,punctuation

I thought that this combination would work:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"/>
  </analyzer
<fieldType>

The problem is that this results in the following for letter to number transitions:

one2three4 -> one,2,three,4

I have tried various combinations of WordDelimiterFilterFactory settings, but none have proven useful. Is there a filter or tokenizer that can handle what I require?

+1  A: 

how about

<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" splitOnNumerics="0" />

that should prevent one2three4 to be split

Raoul Duke
That's what I thought, but the `generateWordParts` does split on numerics regardless of that setting. http://i.imgur.com/WpgCl.png
claytron
Do you have it correctly configured for query time? In your OP I only see an index time analyzer defined. It works for me with solr 1.4 so I guess it's either a bug in 1.3 or a configuration issue on your part.
Raoul Duke
It does the same for query time. I'm starting to think that it is a bug in 1.3.0 also.
claytron
It's certainly possible. I had to submit a few patches to fix bugs in 1.3 myself :-/
Raoul Duke
That was it. Tested on 1.4.1 and it worked as expected. Thanks!
claytron
Turns out it wasn't a bug. The `splitOnNumerics` option wasn't added until the Solr 1.4 release. If the Solr wiki wasn't read-only I'd make note about the inclusion of those options in version 1.4.
claytron
Ah, good to know, thanks!
Raoul Duke