I have a list of strings (company names, in this case), and a Java program that extracts a list of things that look like company names out of mostly-unstructured text. I need to match each element of extracted text to a string in the list. Caveat: the unstructured text has typos, things like "Blah, Inc." referred to as "Blah," etc. I've tried Levenshtein Edit Distance, but that fails for predictable reasons. Are there known best-practices ways of tackling this problem? Or am I back to manual data-entry?
This is not a simple problem, and there are entire companies built around trying to solve it (even for reduced matching sets like company names versus the general case).
If you can identify a discrete number of patterns that valid company names fall into, and that noise does not fall into, then you could tackle this with a series of regular expression matches.
If the patterns are difficult or too numerous, then you could try developing a probabilistic model, perhaps something like a Bayesian network. You would take a subset of your data for training, and perhaps a second subset for a quick validation, and grow the network. Techniques might include genetic programming or setting up a neural network. This approach is obviously not lightweight, and you'd probably want to consider your need carefully before going down this road.
In the work we do at my company we deal with this type of problem all the time. The most successful efforts I have seen use just a few pages of Python code. Python is excellent for string dissection and analysis, and you can call a Python routine from your Java program. Like Greg said, the right answer is highly dependent on the quality of your unstructured text. A good way to start is to quantitatively characterize how it aligns with your golden text. ( E.g. You may find you can match 80% of it by just adding some common alternate match strings like "Blah" and "BLAH INC" instead of just "Blah Inc.")