views:

495

answers:

7

I need to write an algorithm that returns the closest match for a contact based on the name and address entered by the user. Both of these are troubling, since there are so many ways to enter a company name and address, for instance:

Company A, 123 Any Street Suite 200, Anytown, AK 99012
Comp. A, 123 Any St., Suite 200, Anytown, AK 99012
CA, 123 Any Street Ste 200, Anytown, AK 99012

I have looked at doing a Levenshtein distance on the Name, but that doesn't seem a great tool, since they could abbreviate the name. I am looking for something that matches on the most information possible.

My initial attempt was to limit the results first by the first 5 digits of the postal code and then try to filter down to one based on other information, but there must be a more standard approach to getting this done. I am working in .NET but will look at any code you can provide to get an idea on how to accomplish this.

+1  A: 

I don't exactly now how this is accomplished, but all major delivery companies (FedEx, USPS, UPS) seem to have a way of matching an address you input against their database and transforming it to a normalized form. As I've seen this happen on multiple websites (Amazon comes to mind), I am assuming that there is an API to this functionality, but I don't know where to look for it and whether it is suitable for your purposes.

Just a thought though.

EDIT: I found the USPS API

A: 

I think filtering based on zip code first would be the easiest, as finding it is fairly unambiguous. From there you can probably extract the city and street. I'm not sure how you would go about finding the name, but it seems matching it against the address if you already have a database of (name, address) pairs is feasible.

thekidder
A: 

Dun & Bradstreet do this. They charge money because it's really hard. There's no "standard" solution. It's mostly a painful choice between a service like D&B or roll your own.

S.Lott
Actually, this sounds like a *fun* problem ... so I'd go with the later :-)
Ryan Delucchi
A: 

This seems like a good answer.

A: 

As a start, I'd probably do a word-indexed search. That would mean two stages:

Offline stage: Generate an index of all the addresses by their keywords. For example, "Company", "A" and "123" would all become an keywords for the address you provided above. You could do some stemming, which would mean for words like "street" you'd also add a word "st" into its index.

Online stage: The user gives you a search query. Break down the search query into all its keywords, and find all possible matches of each keyword in the database. Tally the number of matched keywords on each address. Then sort the results by the number of matched keywords. This should be able to be done quite quickly if there aren't too many matches, as its just a few sorted list merges and increments, followed finally by a sort.

Given that you know the domain of your problem, you could specialise the algorithm to use knowledge about the domain - for example the zip code filtering mentioned before.

Also just to enable me to provide you with a better answer, are you using an SQL database at all? I ask because the way I would do it is I'd store the keyword index in the SQL database, and then the SQL query to search by keyword becomes quite easy, since the database does all the work.

Ray Hidayat
A: 

Maybe instead of using Levenshtein for the name only, it could be useful when used with the entire string representation of a contact. For instance, the distance of your first example to the second is 7 and to the third 9. Considering the strings have lengths 54, 50 and 45, this seems to be a relatively useful and quite simple similarity measure.

Fabian Steeg
A: 

This is what I would do. I am not aware of algorithms, so I just use what makes sense.

I am assuming that the person would provide name, street address, city name, state name, and zipcode.

If the zipcode is provided in 9 numbers, or has a hyphen, I would strip it down to 5 numbers. I would search the database for all of the addresses that has that zipcode.[query 1] Then I would compare the state letter with the one from the database. If it's not a match, then I would tell that to the user. Same goes for the city name.

From what I understand, a street name is not in numbers, only the house on a street had numbers in it. Further more, the house number is usually at the beginning unless it is house or suite number.

So I would do regex to search for the numbers and the next space or comma next to it. Then find position of the first word that does not has a period(.) or ends in comma. I have part of the street name, so I could do a comparison against the rows fetched earlier, or I would change the query to have the street name LIKE %streetName%.

I am guessing the database has a beginning number and ending number of the house on a block. I would check against that street row to see if the provided street number is on that street. By now you would know the correct data to show, and could look up in a different table as to which name is associated with that house number. I am not sure why you want to compare it. Only use for name comparing would be if you want to find people whose address was not provided. You can look here for comparing string ways http://stackoverflow.com/questions/451884/similar-string-algorithm