views:

34

answers:

1

I am currently working on project where I have to match up a large quantity of user-generated names with a separate list of the same names in a canonical format. The problem is that the user-generated names contains numerous misspellings, abbreviations, as well as simply invalid data, making it hard to do a cross-reference with the canonical data. Any suggestions on methods to do this?

This does not have to be done in real-time and in this case accuracy is more important than speed.

Current ideas for this are:

  1. Do a fuzzy search for the user entered name in the canonical database using an existing search implementation like Lucene or Sphinx, which I presume use something like the Levenshtein distance for this.
  2. Cross-reference on the SOUNDEX hash (which is supposedly computed on the sound of the name rather than spelling) instead of using the actual name.
  3. Some combination of the above

Anyone have any feedback on any of these or ideas of their own?

One of my concerns is that none of the above methods will handle abbreviations very well. Can anyone point me in a direction for some machine learning methods to actually search on expanded abbreviations (or tell me I'm crazy)? Thanks in advance.

+1  A: 

First, I'd add to your list the techniques discussed at Peter Norvig's post on spelling correction.

Second, I'd ask what kind of "user-generated names" you're talking about. Having dealt with both, I believe that the heuristics you'd use for street names are somewhat different from the heuristics for person names. (As a simple example, does "Dr" expand to "Drive" or "Doctor"?)

Third, I'd look at a combination using testing to establish the set of coefficients for combining the results of the various techniques.

joel.neely
Thanks, I feel like there really is no perfect answer to this. I've decided to go with using Lucene as the main way of cross referencing and to use different/custom Analyzer's to expand abbreviations and to do the fuzzy searching.
Gordon