ansaurus

Question

How to use wildchards, fuzzy search with Solr?

Answer 1

+1 A:

Your fieldType name="text" is missing a lot of filters. For reference, here's the text fieldType from the default schema.xml:

<!-- A text field that uses WordDelimiterFilter to enable splitting and matching of
    words on case-change, alpha numeric boundaries, and non-alphanumeric chars,
    so that a query of "wifi" or "wi fi" could match a document containing "Wi-Fi".
    Synonyms and stopwords are customized by external files, and stemming is enabled.
    -->
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <!-- in this example, we will only use synonyms at query time
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    -->
    <!-- Case insensitive stop word removal.
      add enablePositionIncrements=true in both the index and query
      analyzers to leave a 'gap' for more accurate phrase queries.
    -->
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory"
            ignoreCase="true"
            words="stopwords.txt"
            enablePositionIncrements="true"
            />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
  </analyzer>
</fieldType>

For example, the SnowballPorterFilterFactory is the one that enables stemming.

I recommend building your schema based on the default schema.xml, tweaking and modifying as necessary (as opposed to starting from scratch).

Here's the reference for analyzers, tokenizers and filters.

Mauricio Scheffer 2010-01-22 12:29:43

Thanks Mauricio. I use instead of whitespacetokenizer lettertokenizer. Whitespacetokenizer does forget the punctuation characters at the end of the word. All the other things, you listed are fine, and I will use it, but I prefered to begin with a stripped down set. For instance I cannot use now the snowball stemmer, as it is not done yet for my language. Doesn't the query parsing has to something with the SolrQueryParser? http://lucene.apache.org/solr/api/org/apache/solr/search/SolrQueryParser.html Does it?

fifigyuri 2010-01-22 13:41:14

It looks like hungarian stemming can be bought: http://www.lucidimagination.com/search/document/CDRG_ch05_5.6.16. Also why do you ask about SolrQueryParser? Are you looking to extend Solr? Normally you don't need to change code in Solr as it highly extensible and configurable.

Mauricio Scheffer 2010-01-22 14:43:21

ansaurus

tags:

views:

answers:

How to use wildchards, fuzzy search with Solr?

related questions