This is the explanation for the tokenizer in lucene
- Splits words at punctuation
characters, removing punctuation.
However, a dot that's not followed by
whitespace is considered part of a
token.
- Splits words at hyphens, unless
there's a number in the token, in
which case the whole token is
interpreted as a product number and
is not split.
- Recognizes email addresses and internet hostnames as one token.
Found here
this explains why it would be splitting your word.
This is probably the hardest thing to correct, human error. If an individual types in semifinal, this is theoretically not the same as searching semi-final. So if you were to have numerous words that could be typed in different ways ex:
St-Constant
Saint Constant
Saint-Constant
your stuck with the task of having
both st and saint as well as a hyphen or non hyphenated to veriy. your tokens would be huge and each word would need to be compared to see if they matched.
Im still looking to see if there is a good way of approaching this, otherwise, if you don't have a lot of words you wish to use then have all the possibilities stored and tested, or have a loop that splits the word starting at the first letter and moves through each letter splitting the string in half to form two words, testing the whole way through to see if it matches. but again whose to say you only have 2 words. if you are verifying more then two words then you have the problem of splitting the word in multiple sections
example
saint-jean-sur-richelieu
if i come up with anything else I will let you know.