tags:

views:

177

answers:

1

I am using SOLR along with NGramTokenizerFactory to help create search tokens for substrings of words

NGramTokenizer is configured with a minimum word length of 3

This means that I can search for e.g. "unb" and then match the word "unbelievable".

However I have a problem with short words like "I" and "in". These are not indexed by SOLR (I suspect it is because of NGramTokenizer) and therefore I cannot search for them.

I don't want to reduce the minimum word length to 1 or 2, since this creates a huge search index. But I would like SOLR to include whole words whose length is already below this minimum.

How can I do that?

/Carsten

+2  A: 

First of all, try to understand why your words don't get indexed by solr using the "Analysis Tool"

http://localhost:8080/solr/admin/analysis.jsp

Just put the field and the text you are searching for and see which analyser is filtering your short term. I suggest you to do so because you said you have only a "suspect" and you have to be certain about which analyser filters your data.

Then why don't you just simply copy the term in another field without that analyser?

In this way your terms will be indexed twice, and will appear both as exact word and as n-gram. Then you have to deal with the scores of the two different fields.

I hope this has helped you in some way.

Some link for aggregation and copyfield attribute:

Indexing data in multiple fields

Using copy field tag

volothamp
Thanks for your suggestion. I have run the analysis against two words: A normal case - "jeudan" and the 1-letter word "j". Here are the results http://pastie.org/1000520As you can see, it IS actually the NGramTokenizer that is filtering out the 1-letter word - or in this the EdgeNGramTokenizer, but I have tested with both.I could try what you suggest, but I would rather, let Solr do all the text-munging. I do a lot of field-specific searches, so your suggestion would result in the need to rewrite those queries to look in two text-fields instead of one. Possible but counter-intuitive.
Carsten Gehling
Consider that it's typical in solr to have an aggregation field where you make the query, and then a series of fields with different types and analyser. Simply use the copyfield tag to copy all your source field to the target. You don't have to change your queries.
volothamp
Well your answer actually solved this and other problems, that I faced. I didn't know about the analysis tool. I ended up trying a few other filters and tokenizers through the analyser, and ended up using the PhoneticFilter on both the index and query part. Very neat - thanks a lot!
Carsten Gehling