tags:

views:

1039

answers:

7

I have a database of companies. My application receives data that references a company by name, but the name may not exactly match the value in the database. I need to match the incoming data to the company it refers to.

For instance, my database might contain a company with name "A. B. Widgets & Co Ltd." while my incoming data might reference "AB Widgets Limited", "A.B. Widgets and Co", or "A B Widgets".

Some words in the company name (A B Widgets) are more important for matching than others (Co, Ltd, Inc, etc). It's important to avoid false matches.

The number of companies is small enough that I can maintain a map of their names in memory, ie. I have the option of using Java rather than SQL to find the right name.

How would you do this in Java?

A: 

Your database may suport the use of Regular Expressions (regex) - see below for some tutorials in Java - here's the link to the MySQL documentation (as an example):

http://dev.mysql.com/doc/refman/5.0/en/regexp.html#operator_regexp

You would probably want to store in the database a fairly complex regular express statement for each company that encompassed the variations in spelling that you might anticipate - or the sub-elements of the company name that you would like to weight as being significant.

You can also use the regex library in Java

JDK 1.4.2
http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html

JDK 1.5.0
http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/Matcher.html

Using Regular Expressions in Java
http://www.regular-expressions.info/java.html

The Java Regex API Explained
http://www.sitepoint.com/article/java-regex-api-explained/

You might also want to see if your database supports Soundex capabilities (for example, see the following link to MySQL)
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_soundex

Kelvin Meeks
+1  A: 

You can use an LCS algorithm to score them.

I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.

Dustin
A: 

Lucene might be useful?

Andrew Swan
+1  A: 

Have a look at Lucene. It's an open source full text search Java library with 'near match' capabilities.

Nerdfest
+2  A: 

You could standardize the formats as much as possible in your DB/map & input (i.e. convert to upper/lowercase), then use the Levenshtein (edit) distance metric from dynamic programming to score the input against all your known names.

You could then have the user confirm the match & if they don't like it, give them the option to enter that value into your list of known names (on second thought--that might be too much power to give a user...)

Drew Hall
I only recently found out about this algorithm about 6 months ago, but it has served me incredibly well since! Also it makes me look smart when I say "oh just use a Levenshtein Distance'. :-)
Aidos
+1  A: 

I'd do LCS ignoring spaces, punctuation, case, and variations on "co", "llc", "ltd", and so forth.

Adam Jaskiewicz
A: 

vote up 1 vote down

You can use an LCS algorithm to score them.

I do this in my photo album to make it easy to email in photos and get them to fall into security categories properly.

* LCS code
* Example usage (guessing a category based on what people entered)

to be more precise, better than Least Common Subsequence, Least Common Substring should be more precise as the order of characters is important.

charpentier damien