views:

95

answers:

2

I am using Lucene in JAVA and indexing a table in our database based on company name. After the index I wish to do a fuzzy match (Levenshtein distance) on a value we wish to input into the database. The reason is that we do not want to be entering dupes because of spelling errors.

For example if I have the company name "Widget Makers XYZ" I don't want to insert "Widget Maker XYZ".

From what I've read Lucene's fuzzy match algorithm should give me a number between 0 and 1, I want to do some testing and then determine and adequate value for us determine what is valid or invalid.

The problem is I am stuck, and after searching what seems like everywhere on the internet, need the StackOverflow community's help.

Like I said I have indexed the database on company name, and then have the following code:

IndexSearcher searcher = new IndexSearcher(directory);  

new QueryParser(Version.LUCENE_30, "company", analyzer);

Query fuzzy_query = new FuzzyQuery(new Term("company", "Center"));

I encounter the problem afterwards, basically I do not know how to get the fuzzy match value. I know the code must look something like the following, however no collectors seem to fit my needs. (As you can see right now I am only able to count the number of matches, which is useless to me)

TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);

searcher.search(fuzzy_query, collector);

System.out.println("\ncollector.getTotalHits() = " + collector.getTotalHits());

Also I am unable to use the ComplexPhraseQueryParser class which is shown in the Lucene documentation. I am doing:

import org.apache.lucene.queryParser.*;

Does anybody have an idea as to why its inaccessible or what I am doing wrong? Apologies for the length of the question.

A: 

You can get the match values with:

TopDocs topDocs = collector.topDocs();
for(ScoreDoc scoreDoc : topDocs.scoreDocs) {
    System.out.println(scoreDoc.score);
}
fgb
Of what type should the collector be? I get no output when I run this.
+1  A: 

You do not need Lucene to get the score. Take a look at Simmetrics library, it is exceedingly simple to use. Just add the jar and use it thus:

Levenstein ld = new Levenstein ();
float sim = ld.GetSimilarity(string1, string2);

Also do note, depending on the type of data (i.e. longer strings, # whitespaces etc.), you might want to look at other algorithms such as Jaro-Winkler, Smith-Waterman etc.

You could use the above to determine to collapse fuzzy duplicate strings into one "master" string and then index.

Mikos
I'm looking at the Simmetrics library and it does look very promising. I wanted to use Lucene because of its indexing abilities since I am searching a database of 60K or more company names. Is Simmetrics compatible with Lucene on any level?
Hmm not sure why you need SimMetrics to be compatible - whatever that means. Write an app to loop through the db rows and cluster the names by similarity using Simmetrics - you can play around with various thresholds to determine best fit. So you create a lookup table"Widget Makers XYZ", -< "Widget Maker XYZ", "Widgt Maker XYZ", Widget Makers XY".... and so on...where Widget Makers XYZ becomes the master string, which is what you write to index.
Mikos
Sorry for being unclear, I meant can SimMetrics read from the index that Lucene creates?I'd rather not create any unneeded or temporary tables, unless I have to. And want a fast match time.My layout for the program was:1) Index all companies by name with Lucene, store the index in RAM.2) Each company name that wants to be inserted has to meet a certain algorithmic requirement that is TBD, but is going to rely on Leinshtein's algorithm and then (if needed) the double metaphone algorithm. And possibly some from the SimMetrics library now.
Not sure if Simmetrics can read from Lucene, prolly not. You can have the following 2-step approach:1. create an index and for each company that needs to be inserted query the index (this should give you a workable subset say 10 results to run the string dist comparison)2. Compare the new co. name to the 10 results and see if new passes the threshold or is a dupe.BTW Levenstein is included in SimMetrics so you need not implement it yourself.
Mikos