views:

258

answers:

4

Hi,

I'm playing around with Lucene and noticed that the use of a hyphen (e.g. "semi-final") will result in two words ("semi" and "final" in the index. How is this supposed to match if the users searches for "semifinal", in one word?

Edit: I'm just playing around with the StandardTokenizer class actually, maybe that is why? Am I missing a filter?

Thanks!

(Edit) My code looks like this:

            StandardAnalyzer sa = new StandardAnalyzer();
            TokenStream ts = sa.TokenStream("field", new StringReader("semi-final"));

            while (ts.IncrementToken())
            {
                string t = ts.ToString();
                Console.WriteLine("Token: " + t);
            }
A: 

This is the explanation for the tokenizer in lucene

- Splits words at punctuation
   characters, removing punctuation.
   However, a dot that's not followed by
   whitespace is considered part of a
   token. 

 - Splits words at hyphens, unless
   there's a number in the token, in
   which case the whole token is
   interpreted as a product number and
   is not split.

 - Recognizes email addresses and internet hostnames as one token.

Found here

this explains why it would be splitting your word.

This is probably the hardest thing to correct, human error. If an individual types in semifinal, this is theoretically not the same as searching semi-final. So if you were to have numerous words that could be typed in different ways ex:

St-Constant

Saint Constant

Saint-Constant

your stuck with the task of having both st and saint as well as a hyphen or non hyphenated to veriy. your tokens would be huge and each word would need to be compared to see if they matched.

Im still looking to see if there is a good way of approaching this, otherwise, if you don't have a lot of words you wish to use then have all the possibilities stored and tested, or have a loop that splits the word starting at the first letter and moves through each letter splitting the string in half to form two words, testing the whole way through to see if it matches. but again whose to say you only have 2 words. if you are verifying more then two words then you have the problem of splitting the word in multiple sections

example

saint-jean-sur-richelieu

if i come up with anything else I will let you know.

Justin Gregoire
I added the code. I read that also, I'm just wondering what to use instead then.
Thanks for your effort, but the other answer provided me with the info I needed.
A: 

You can write your own tokenizer which will produce for words with hyphen all possible combinations of tokens like that:

  • semifinal
  • semi
  • final

You will need to set proper token offsets to tell lucene that semi and semifinal actually start at the same place in document.

Yaroslav
Is Lucene OK with having several terms with the same offset? Do all the searches handle this correctly?
Having offset for terms makes sense for phrase search. Phrase search handles it correctly. Term search just ignores offsets and therefore document containing 'semi-final' will get more score than document containing 'semifinal' for search query 'semifinal OR final' because it contains 2 of searched terms while latter contains just one.
Yaroslav
A: 

Hello,

I would recommend you use the WordDelimiterFilter from Solr (you can use it in just your Lucene application as a TokenFilter added to your analyzer, just go get the java file for this filter from Solr and add it to your application).

This filter is designed to handle cases just like this: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

Robert Muir
A: 

If you're looking for a port of the WordDelimiterFilter then I advise a google of WordDelimiter.cs, I found such a port here:

http://osdir.com/ml/attachments/txt9jqypXvbSE.txt

I then created a very basic WordDelimiterAnalyzer:

public class WordDelimiterAnalyzer: Analyzer
{
    #region Overrides of Analyzer

    public override TokenStream TokenStream(string fieldName, TextReader reader)
    {
        TokenStream result = new WhitespaceTokenizer(reader);

        result = new StandardFilter(result);
        result = new LowerCaseFilter(result);
        result = new StopFilter(true, result, StopAnalyzer.ENGLISH_STOP_WORDS_SET);
        result = new WordDelimiterFilter(result, 1, 1, 1, 1, 0);

        return result;
    }

    #endregion
}

I said it was basic :)

If anyone has an implementation I would be keen to see it!

Joe