ansaurus

Question

Answer 1

A:

You can get the match values with:

TopDocs topDocs = collector.topDocs();
for(ScoreDoc scoreDoc : topDocs.scoreDocs) {
    System.out.println(scoreDoc.score);
}

fgb 2010-07-29 00:11:56

Of what type should the collector be? I get no output when I run this.

2010-07-29 14:24:48

Answer 2

+1 A:

You do not need Lucene to get the score. Take a look at Simmetrics library, it is exceedingly simple to use. Just add the jar and use it thus:

Levenstein ld = new Levenstein ();
float sim = ld.GetSimilarity(string1, string2);

Also do note, depending on the type of data (i.e. longer strings, # whitespaces etc.), you might want to look at other algorithms such as Jaro-Winkler, Smith-Waterman etc.

You could use the above to determine to collapse fuzzy duplicate strings into one "master" string and then index.

Mikos 2010-07-29 13:18:07

I'm looking at the Simmetrics library and it does look very promising. I wanted to use Lucene because of its indexing abilities since I am searching a database of 60K or more company names. Is Simmetrics compatible with Lucene on any level?

2010-07-29 14:20:20

Hmm not sure why you need SimMetrics to be compatible - whatever that means. Write an app to loop through the db rows and cluster the names by similarity using Simmetrics - you can play around with various thresholds to determine best fit. So you create a lookup table"Widget Makers XYZ", -< "Widget Maker XYZ", "Widgt Maker XYZ", Widget Makers XY".... and so on...where Widget Makers XYZ becomes the master string, which is what you write to index.

Mikos 2010-07-29 14:46:12

Sorry for being unclear, I meant can SimMetrics read from the index that Lucene creates?I'd rather not create any unneeded or temporary tables, unless I have to. And want a fast match time.My layout for the program was:1) Index all companies by name with Lucene, store the index in RAM.2) Each company name that wants to be inserted has to meet a certain algorithmic requirement that is TBD, but is going to rely on Leinshtein's algorithm and then (if needed) the double metaphone algorithm. And possibly some from the SimMetrics library now.

2010-07-29 15:13:49

Not sure if Simmetrics can read from Lucene, prolly not. You can have the following 2-step approach:1. create an index and for each company that needs to be inserted query the index (this should give you a workable subset say 10 results to run the string dist comparison)2. Compare the new co. name to the 10 results and see if new passes the threshold or is a dupe.BTW Levenstein is included in SimMetrics so you need not implement it yourself.

Mikos 2010-07-29 16:07:04

ansaurus

tags:

views:

answers:

Fuzzy Queries in Lucene

related questions