views:

153

answers:

2

Hi,

I'm using solr's faceting and i've run into a problem that i was hoping i could get around using filters.

Basically some times a town name will come through to SOLR as

"CAMBRIDGE"

and sometime's it will come through as

"Cambridge"

I wanted to use a filter in Solr to stop the SCREAMING CAPS version of the town name. It seems there is a fitler to make all the text lower case.

<!-- A text field that only sorts out casing for faceting -->
    <fieldType name="text_facet" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <analyzer type="query">
  <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>

I was wondering if anyone knew of a filter which will Ignore the First character of a word and apply lowercase to the rest of the characters. E.g.

  • CAMBRIDGE >> Cambridge
  • KingsTON Upon HULL >> Kingston Upon Hull

etc

Alternatively if it's easy to write your own filters.. some help on how to do that would be appreciated.. I'm not a Java person..

Thanks

+2  A: 

AFAIK there is no built-in filter like that. If you want to write it, see LowerCaseFilterFactory and LowerCaseFilter for reference, it doesn't seem to be very hard.

Or you could do this client-side, i.e. in SolrNet you could write a ISolrOperations decorator that does the necessary transformations after the real query, using ToTitleCase.

Mauricio Scheffer
I'm using a very old version of SolrNet.. So i'll see if i can use the ISolrOperations. However I think it might be about time to learn java.. I know c# so the syntax shouldn't be a problem.. Thanks Mauricio
CraftyFella
ISolrOperations has been around since revision 1 :-) Anyway I recommend upgrading to the latest version...
Mauricio Scheffer
Nice i'll deffo do that.
CraftyFella
+1  A: 

Perhaps you could make use of the solr.PatternReplaceCharFilterFactory?

<fieldType name="textCharNorm" class="solr.TextField">
  <analyzer>
    <filter class="solr.LowerCaseFilterFactory"/>
    <charFilter class="solr.PatternReplaceCharFilterFactory"
                pattern="([^\s]{1})([^\s]*)" replaceWith="\U$1\L$2"/>
  </analyzer>
</fieldType>

Notice, I haven't tested the code or solr.PatternReplaceCharFilterFactory, so I'm not sure if it works. If you need to build your own filter this guide might be useful:

http://robotlibrarian.billdueber.com/building-a-solr-text-filter-for-normalizing-data/

// John

John P
Thanks.. i'll check that out.
CraftyFella