I am currently working on project where I have to match up a large quantity of user-generated names with a separate list of the same names in a canonical format. The problem is that the user-generated names contains numerous misspellings, abbreviations, as well as simply invalid data, making it hard to do a cross-reference with the canonical data. Any suggestions on methods to do this?
This does not have to be done in real-time and in this case accuracy is more important than speed.
Current ideas for this are:
- Do a fuzzy search for the user entered name in the canonical database using an existing search implementation like Lucene or Sphinx, which I presume use something like the Levenshtein distance for this.
- Cross-reference on the SOUNDEX hash (which is supposedly computed on the sound of the name rather than spelling) instead of using the actual name.
- Some combination of the above
Anyone have any feedback on any of these or ideas of their own?
One of my concerns is that none of the above methods will handle abbreviations very well. Can anyone point me in a direction for some machine learning methods to actually search on expanded abbreviations (or tell me I'm crazy)? Thanks in advance.