views:

394

answers:

15

I have Persons table in SQL Server 2008.

My goal is to find Persons who have almost similar addresses.

The address is described with columns state, town, street, house, apartment, postcode and phone.

Due to some specific differences in some states (not US) and human factor (mistakes in addresses etc.), address is not filled in the same pattern.

Most common mistakes in addresses

  1. Case sensitivity
  2. Someone wrote "apt.", another one "apartment" or "ap." (although addresses aren't written in English)
  3. Spaces, dots, commas
  4. Differences in writing street names, like 'Dr. Jones str." or "Doctor Jones street" or "D. Jon. st." or "Dr Jones st" etc.

The main problem is that data isn't in the same pattern, so it's really difficult to find similar addresses.

Is there any algorithm for this kind of issue?

Thanks in advance.

UPDATE

  1. As I mentioned address is separated into different columns. Should I generate a string concatenating columns or do your steps for each column? I assume I shouldn't concatenate columns, but if I'll compare columns separately how should I organize it? Should I find similarities for each column an union them or intersect or anything else?
  2. Should I have some statistics collecting or some kind of educating algorithm?
A: 

the possibilities of such variations are countless and even if such an algorithm exists, it can never be fool-proof. u can't have a spell checker for nouns after all. what you can do is provide a drop-down list of previously entered field values, so that they can select one, if a particular name already exists. its better to have separate fields for each value like apartments and so on.

moonpie187
How's google's "Did you mean" working. Isn't that an algorithm for searching similarities?
hgulyan
+5  A: 

I would try to do the following:

  • split up the address in multiple words, get rid of punctuation at the same time
  • check all the words for patterns that are typically written differently and replace them with a common name (e.g. replace apartment, ap., ... by apt, replace Doctor by Dr., ...)
  • put all the words back in one string alphabetically sorted
  • compare all the addresses using a fuzzy string comparison algorithm, e.g. Levenshtein
  • tweak the parameters of the Levenshtein algorithm (e.g. you want to allow more differences on longer strings)
  • finally do a manual check of the strings

Of course, the solution to keep your data 'in shape' is to have explicit fields for each of your characteristics in your database. Otherwise, you will end up doing this exercise every few months.

Patrick
Main problem is that I can't remove common names. There's too many such things. One wrote "Jones", another "Jon.", "Jo.", "Jone." or if someone wrote "D. Jones", I can't just remove dot and space, I'll have DJones in that case. What if there's a street DJones?
hgulyan
@Patrick Yeah. It will not work good. :) Keep in mind: FJones and DJones are very simiar on char level, but they are defenetly different persons.(or not...typo?) DoktorJones and DJones are maybe the same person - but levenshtein will tell FJones is more likely the one you want. Replacing the common abbrevations is absolutly neccessary to find a working soulution.
InsertNickHere
Take a look at the Levenshtien algorithm mentiond -- its not just a simple string comaprision!
James Anderson
You already have street, town separated out nicely, dont scrunch them back together.
James Anderson
@hgulyan, @InsertNickHere, there is no miracle if you want to allow some differences (like Jones vs. Jon.), but no other differences like DJones and FJones. Suppose that a certain William Jones shortens his name to Bill Jones, than suddenly WJones and BJones are the same. Don't expect miracles from this algorith. It's just to give you a little help, not to automate everything.
Patrick
@James, of course. Things separated should be kept separated. There's no need to merge them again. In the end, everything should be separated in the database: First Name, Middle Names, Last Name Street, House Number, Apartment Number, ...
Patrick
@hgulyan, your example will be taken care of by the distance measuring algorithms. You can not know if D. Jones is actually Davy Jones street or Dwight Jones street, the algorithm will give only the score. Check Miko's answer.
Unreason
A: 

You could throw all addresses at a web service like Google Maps (I don't know whether this one is suitable, though) and see whether they come up with identical GPS coordinates.

Thomas
They're written in Armenian. It won't work, anyway I don't think that connecting somehow to google maps from windows application on every search is a good choice. Mostly users have access only to local network.
hgulyan
A: 

A possibility is to have a dictionary table in the database that maps all the variants to the 'proper' version of the word:

*Value* | *Meaning*
Apt.    | Apartment
Ap.     | Apartment
St.     | Street

Then you run each word through the dictionary before you compare.

Edit: this alone is too naive to be practical (see comment).

Mau
What if someone just decided to do that with a street name. I made an example of "Dr. Jones"? That's a good solution for apartment, street and that kind of names, but not for any other keys.
hgulyan
You're right, a bit too naive.
Mau
Not naive -- just the first step in a very complex process!
James Anderson
+1  A: 

I think the amount of data could affect what approach works best for you.
I had a similar problem when indexing music from compilation albums with various artists. Sometimes the artist came first, sometimes the song name, with various separator styles.

What I did was to count the number of occurrences on other entries with the same value to make an educated guess wether it was the song name or an artist.

Perhaps you can use soundex or similar algorithm to find stuff that are similar.

EDIT: (maybe I should clarify that I assumed that artist names were more likely to be more frequently reoccurring than song names.)

MattBianco
How do you do that "educate guessing"?soundex algorithm is very interesting, but I don't need to find similarly pronounced alphabets. It's not my goal.
hgulyan
I thought of soundex as a potential hash algorithm for reducing the amount of data to compare. Chances are that DJones and Dr.Jones hash to the same (or similar) value. Your question title is about *almost similar values*. Educated guess: If I have 76 *The Chemical Brothers* in my collection, and only 2 *Galvanize*, I simply assume the most frequent occurrence was the artist name.
MattBianco
+2  A: 

The main problem I see here is to exactly define equality. Even if someone writes Jon. and another Jone. - you will never be able to say if they are the same. (Jon-Jonethan,Joneson,Jonedoe whatever ;)

I work in a firm where we have to handle exact this problem - I'm afraid I have to tell you this kind of checking the adress lists for navigation systems is done "by hand" most of the time. Abbrevations are sometimes context dependend, and there are other things that make this difficult. Ofc replacing string etc is done with python - but telling you the MEANING of such an abbr. can only done by script in a few cases. ("St." -> Can be "Saint" and "Street". How to decide? impossible...this is human work.).

Another big problem is as you said "Is there a street "DJones" or a person? Or both? Which one is ment here? Is this DJones the same as Dr Jones or the same as Don Jones? Its impossible to decide!

You can do some work with lists as presented by another answer here - but it will give you enough "false positives" or so.

InsertNickHere
You're right, we can't know which street "Jon." is, but compare all other data like apartment, house, phone number, we can find the most similar lines. There can't be absolute right similar finding, that's why I wrote "almost similar" :)Good notice, that there're cases when you can't just replace one string with another, but you can try each case.
hgulyan
I don't search a person's name, I just search in a person's address. It doesn't matter if there's a person with a street name. I meant that in case of removing dots, commas, spaces etc. you'll get "DJones" from "D. Jones" and it can be similar to street "DJones" if there is one with this name.
hgulyan
@hgulyan Hm. I try to follow you. :) Your adresses are all american? German adresses always have a "PLZ" "Street name" and "Number". Maybe if would be good for the start to find such similarities in usa adresses? For the beginning you could just take "house number" and "country id" (if there is such a thing). But this will need that all adresses contain these keys.
InsertNickHere
My addresses are armenian, written with armenian letters. My columns are separate, I don't have one address field. It's house, apartment, street etc., but they put "str." or "st." etc. before street name, if a street name is very long, they write just first 5-6 letters.
hgulyan
+6  A: 

Suggest approaching it thus:

  1. Create word-level n-grams (a trigram/4-gram might do it) from the various entries

  2. Do a many x many comparison for string comparison and cluster them by string distance. Someone suggested Levenshtein; there are better ones for this kind of task, Jaro-Winkler Distance and Smith-Waterman work better. A libraryt such as SimMetrics would make life a lot easier

  3. Once you have clusters of n-grams, you can resolve the whole string using the constituent subgrams i.e. D.Jones St => Davy Jones St. => DJones St.

Should not be too hard, this is an all-too-common problem.

Update: Based on your update above, here are the suggested steps

  1. Catenate your columns into a single string, perhaps create a db "view" . For example,

    create view vwAddress as select top 10000 state town, street, house, apartment, postcode, state+ town+ street+ house+ apartment+ postcode as Address from ...

  2. Write a separate application (say in Java or C#/VB.NET) and Use an algorithm like JaroWinkler to estimate the string distance for the combined address, to create a many x many comparison. and write into a separate table address1 | address n | similarity

You can use Simmetrics to get the similarity thus:

 JaroWinnkler objJw = new JaroWinkler()
double sim =  objJw.GetSimilarity (address1, addres n);
  1. You could also trigram it so that an address such as "1 Jones Street, Sometown, SomeCountry" becomes "1 Jones Street", "Jones Street Sometown", and so on.... and compare the trigrams. (or even 4-grams) for higher accuracy.

  2. Finally you can order by similarity to get a cluster of most similar addresses and decide an approprite threshold. Not sure why you are stuck

Mikos
Most interesting answer, but also most hard one. Can you, please, bring some examples or any additional links, articles or anything?:)
hgulyan
@hgulyan - What specifically are you having trouble with? Google up n-grams and you should see plenty of material. Once you have (say)tri-grammed your entries, put them in a temp. table and now do a many x many comparison of the rows. Voila.
Mikos
SimMetrics offers a huge library of string comparison algos, so you can focus on the implementation. Just write some code in Java or .NET and you are good to go.
Mikos
+1 n-grams seem to be the way to go here
Unreason
+2  A: 

You have a postcode field!!!

So, why don't you just buy a postcode table for your country and use that to clean up your street/town/region/province information?

I_LOVE_MONEY
Unfortunately human factor inputed 0 instead of the real postcode :(
hgulyan
A postcode of zero or a postcode that is not in your postcode table, should be manually checked. But the majority of your postcodes will be valid. If not, then you should start firing your cross eyed data entry workers. You should also integrate your postcode table into your app, so that users don't have to type in any address.
I_LOVE_MONEY
That's a great advice, thank you, I would try to find anything like that for Armenia. There's also a problem, that mostly people doesn't know or remember their postcode. Don't laugh, it's true:)
hgulyan
I assume that people do know in which province/town/street they live :D Based on that information, you can retrieve the postcode for that person. Use the longest word in the street field when searching the postcode table. Example: in "Dr. Jones Str.", the longest word is "Jones". So simply do a like '%Jones%' on the postcode table. Along with the province/town information, you will very likely get the proper postcode. Once you get all postcodes, you can do a total cleanup based on the postcode field.
I_LOVE_MONEY
Let's hope that Armenia has a good postal service, because they are the ones that mostly track/create postcode tables.
I_LOVE_MONEY
A: 

Another idea is to use learning. For example you could learn, for each abbreviation and its place in the sentence, what the abbreviation means.

3 Jane Dr. -> Dr (in 3rd position (or last)) means Drive
Dr. Jones St -> Dr (in 1st position) means Doctor

You could, for example, use decision trees and have a user train the system. Probably few examples of each use would be enough. You wouldn't classify single-letter abbreviations like D.Jones that could be David Jones, or Dr. Jones as likely. But after a first level of translation you could look up a street index of the town and see if you can expand the D. into a street name.

Again, you would run each address through the decision tree before storing it.

It feels like there should be some commercial products doing this out there.

Mau
A: 

One method could be to apply the Levenshtein distance algorithm to the address fields. This will allow you to compare the strings for similarity.

Edit After looking at the kinds of address differences you are dealing with, this may not be helpful after all.

Neil Aitken
Thank you in any case. Any help is much appreciated :)
hgulyan
+2  A: 

I did a project like this in the last centuary. Basicly it was a consolidation of two customer files after a merger, and, involved names and addresses from three different sources.

Firstly as many posters have suggested, convert all the common words and abbreveations and spelling mistakes to a common form "Apt." "Apatment" etc. to "Apt".

Then look through the name and identifiy the first letter of the first name, plus the first surname. (Not that easy consider "Dr. Med. Sir Henry de Baskerville Smythe") but dont worry where there are amiguities just take both! So if you lucky you get HBASKERVILLE and HSMYTHE. Now get rid of all the vowels as thats where most spelling variations occur so now you have HBSKRVLL HSMTH.

You would also get these strings from "H. Baskerville","Sir Henry Baskerville Smith" and unfortunately "Harold Smith" but we are talking fuzzy matching here!

Perform a similar exercise on the street, and apartment and postcode fields. But do not throw away the original data!

You now come to the interesting bit first you compare each of the original strings and give say 50 points for each string that matches exactly. Then go through you "normalised" strings and give say 20 points for each one that matches exactly. Then go through all the strings and give say 5 points for each four character or more substring they have in common. For each pair compared you will end up with some with scores > 150 which you can consider as a certain match, some with scores less than 50 which you can consider not matched and some inbetween which have some probability of matching.

You need some more tweaking to improve this by adding various rules like "subtract 20 points for a surname of 'smith'". You really have to keep running and tweaking until you get happy with the resulting matches, but, once you look at the results you get a pretty good feel which score to consider a "match" and which are the false positives you need to get rid of.

James Anderson
Nice technique!
Mau
+1  A: 

One important thing that you mention in the comments is that you are going to do this interactively.

This allows to parse user input and also at the same time validate guesses on any abbreviations and to correct a lot of mistakes (the way for example phone number entry works some contact management systems - the system does the best effort to parse and correct the country code, area code and the number, but ultimately the user is presented with the guess and has the chance to correct the input)

If you want to do it really good then keeping database/dictionaries of postcodes, towns, streets, abbreviations and their variations can improve data validation and pre-processing.

So, at least you would have fully qualified address. If you can do this for all the input you will have all the data categorized and matches can then be strict on certain field and less strict on others, with matching score calculated according weights you assign.

After you have consistently pre-processed the input then n-grams should be able to find similar addresses.

Unreason
You're certainly right, we're to improve the data, but there're more than a million rows and it really hard to make all this changes, but I think of an automation way of making changes.
hgulyan
I was commenting on adding new data mostly. For cleaning up existing data you should look at http://stackoverflow.com/questions/682093/address-validation-api as well.
Unreason
+1  A: 

Have you looked at SQL Server Integration Services for this? The Fuzzy Lookup component allows you to find 'Near matches': http://msdn.microsoft.com/en-us/library/ms137786.aspx

For new input, you could call the package from .Net code, passing the value row to be checked as a set of parameters, you'd probably need to persist the token index for this to be fast enough for user interaction though.

There's an example of address matching here: http://msdn.microsoft.com/en-us/magazine/cc163731.aspx

Meff
Will this work for non-unicode armenian?
hgulyan
+1  A: 

I'm assuming that response time is not critical and that the problem is finding an existing address in a database, not merging duplicates. I'm also assuming the database contains a large number of addresses (say 3 million), rather than a number that could be cleaned up economically by hand or by Amazon's Mechanical Turk.

Pre-computation - Identify address fragments with high information content.

  1. Identify all the unique words used in each database field and count their occurrences.
  2. Eliminate very common words and abbreviations. (Street, st., appt, apt, etc.)

When presented with an input address,

  1. Identify the most unique word and search (Street LIKE '%Jones%') for existing addresses containing those words.
  2. Use the pre-computed statistics to estimate how many addresses will be in the results set
  3. If the estimated results set is too large, select the second-most unique word and combine it in the search (Street LIKE '%Jones%' AND Town LIKE '%Anytown%')
  4. If the estimated results set is too small, select the second-most unique word and combine it in the search (Street LIKE '%Aardvark%' OR Town LIKE '%Anytown')
  5. if the actual results set is too large/small, repeat the query adding further terms as before.

The idea is to find enough fragments with high information content in the address which can be searched for to give a reasonable number of alternatives, rather than to find the most optimal match. For more tolerance to misspelling, trigrams, tetra-grams or soundex codes could be used instead of words.

Obviously if you have lists of actual states / towns / streets then some data clean-up could take place both in the database and in the search address. (I'm very surprised the Armenian postal service does not make such a list available, but I know that some postal services charge excessive amounts for this information. )

As a practical matter, most systems I see in use try to look up people's accounts by their phone number if possible: obviously whether that is a practical solution depends upon the nature of the data and its accuracy.

(Also consider the lateral-thinking approach: could you find a mail-order mail-list broker company which will clean up your database for you? They might even be willing to pay you for use of the addresses.)

MZB
A: 

I've found a great article.

Adding some dlls as sql user-defined functions we can use string comparison algorithms using SimMetrics library.

Check it

http://anastasiosyal.com/archive/2009/01/11/18.aspx

hgulyan