Cross Referencing Databases on Fuzzy Data | ansaurus

tags:

views:

34

answers:

1

Q:

Cross Referencing Databases on Fuzzy Data

I am currently working on project where I have to match up a large quantity of user-generated names with a separate list of the same names in a canonical format. The problem is that the user-generated names contains numerous misspellings, abbreviations, as well as simply invalid data, making it hard to do a cross-reference with the canonical data. Any suggestions on methods to do this?

This does not have to be done in real-time and in this case accuracy is more important than speed.

Current ideas for this are:

Do a fuzzy search for the user entered name in the canonical database using an existing search implementation like Lucene or Sphinx, which I presume use something like the Levenshtein distance for this.
Cross-reference on the SOUNDEX hash (which is supposedly computed on the sound of the name rather than spelling) instead of using the actual name.
Some combination of the above

Anyone have any feedback on any of these or ideas of their own?

One of my concerns is that none of the above methods will handle abbreviations very well. Can anyone point me in a direction for some machine learning methods to actually search on expanded abbreviations (or tell me I'm crazy)? Thanks in advance.

+1 A:

First, I'd add to your list the techniques discussed at Peter Norvig's post on spelling correction.

Second, I'd ask what kind of "user-generated names" you're talking about. Having dealt with both, I believe that the heuristics you'd use for street names are somewhat different from the heuristics for person names. (As a simple example, does "Dr" expand to "Drive" or "Doctor"?)

Third, I'd look at a combination using testing to establish the set of coefficients for combining the results of the various techniques.

joel.neely 2009-12-10 00:40:50

Thanks, I feel like there really is no perfect answer to this. I've decided to go with using Lucene as the main way of cross referencing and to use different/custom Analyzer's to expand abbreviations and to do the fuzzy searching.

Gordon 2009-12-22 19:50:11

related questions

Best way to search data stored as XML in Sql Server?

What are the alternative's to using the iThenticate service for content comparison?

Search by hash?

Free text search integrated with code coverage

How-to: Ranking Search Results

Find item in WPF ComboBox

Find in Files: Search all code in Team Foundation Server

Searching for phone numbers in mysql

How do I implement Search Functionality in a website?

Can you perform an AND search of keywords using FREETEXT() on SQL Server 2005?

How do I search content, within audio files/streams?

Search Plugin for Safari

Search strategies in ORMs

Using Lucene to search for email addresses

SQL Server Full Text Searching

How do you do a case insensitive search using a pattern modifier using less ?

WildcardQuery error in Solr

PowerShell FINDSTR eqivalent?

Parsing search queries in Java

Need Pattern for dynamic search of multiple sql tables

grep a file, but show several surrounding lines?

Eclipse : Class file name must end with .class exception in Java Search

MOSS SSP problem - Failed database logons from deleted SSP

Incomplete results with Turkish characters in Indexing Service

Lucene Score results