views:

62

answers:

2

I have a list of strings (company names, in this case), and a Java program that extracts a list of things that look like company names out of mostly-unstructured text. I need to match each element of extracted text to a string in the list. Caveat: the unstructured text has typos, things like "Blah, Inc." referred to as "Blah," etc. I've tried Levenshtein Edit Distance, but that fails for predictable reasons. Are there known best-practices ways of tackling this problem? Or am I back to manual data-entry?

+3  A: 

This is not a simple problem, and there are entire companies built around trying to solve it (even for reduced matching sets like company names versus the general case).

If you can identify a discrete number of patterns that valid company names fall into, and that noise does not fall into, then you could tackle this with a series of regular expression matches.

If the patterns are difficult or too numerous, then you could try developing a probabilistic model, perhaps something like a Bayesian network. You would take a subset of your data for training, and perhaps a second subset for a quick validation, and grow the network. Techniques might include genetic programming or setting up a neural network. This approach is obviously not lightweight, and you'd probably want to consider your need carefully before going down this road.

Greg Harman
+1  A: 

In the work we do at my company we deal with this type of problem all the time. The most successful efforts I have seen use just a few pages of Python code. Python is excellent for string dissection and analysis, and you can call a Python routine from your Java program. Like Greg said, the right answer is highly dependent on the quality of your unstructured text. A good way to start is to quantitatively characterize how it aligns with your golden text. ( E.g. You may find you can match 80% of it by just adding some common alternate match strings like "Blah" and "BLAH INC" instead of just "Blah Inc.")

Pete