views:

140

answers:

2

i have a long list of words that i put into a very simple SOLR / Lucene database. my goal is to find 'similar' words from the list for single-term queries, where 'similarity' is specifically understood as (damerau) levensthein edit distance. i understand SOLR provides such a distance for spelling suggestions.

in my SOLR schema.xml, i have configured a field type string:

<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

which i use to define a field

<field name='term' type='string' indexed='true' stored='true' required='true'/>

i want to search this field and have results returned according to their levenshtein edit distance. however, when i run a query like webspace~0.1 against SOLR with debugging and explanations on, the report shows that a whole bunch of considerations went into calculating the scores, e.g.:

"1582":"
1.1353534 = (MATCH) sum of:
  1.1353534 = (MATCH) weight(term:webpage^0.8148148 in 1581), product of:
    0.08618848 = queryWeight(term:webpage^0.8148148), product of:
      0.8148148 = boost
      13.172914 = idf(docFreq=1, maxDocs=386954)
      0.008029869 = queryNorm
    13.172914 = (MATCH) fieldWeight(term:webpage in 1581), product of:
      1.0 = tf(termFreq(term:webpage)=1)
      13.172914 = idf(docFreq=1, maxDocs=386954)
      1.0 = fieldNorm(field=term, doc=1581)

clearly, for my application, term frequencies, idfs and so on are meaningless, as each document only contains a single term. i tried to use the spelling suggestions component, but didn't manage to make it return the actual similarity scores.

can anybody provide hints how to configure SOLR to perform levensthein / jaro-winkler / n-gram searches with scores returned and without doing additional stuff like tf, idf, boost and so included? is there a bare-bones configuration sample for SOLR somewhere? i find the number of options truly daunting.

+2  A: 

If you're using a nightly build, then you can sort results based on levenshtein distance using the strdist function:

q=term:webspace~0.1&sort=strdist("webspace", term, edit) desc

More details here and here

Karl Johansson
A: 

Solr/Lucene doesn't appear to be a good fit for this application. You are likely better off. with SimMetrics library . It offers a comprehensive set of string-distance calculators incl. Jaro-Winkler, Levenstein etc.

Mikos
this is a very interesting link indeed. i wish there was a standard library as comprehensive like this for python as well. unfortunately, since i have to search over hundreds of thousands of words, a solution without indexing will likely be too slow (but i would have to try first). also, i am not quite sure how to integrate a java library into my python project. maybe via HTTP.
flow