views:

434

answers:

6

I have a table "City" which contains city names, and I have a another table which I just created and contains cities from different sources. When I run a query to match the cities between the two tables I find about 5000 mismatches.

So please give some queries which I can use to match cities (because sometimes users enter city names with one or two character different)... I have created a query which is working fine but I need such a query to match more.

Please suggest me what to do in such a situation.

SELECT distinct hsm.countryname,co.countryname,hsm.city,co.city
FROM   HotelSourceMap AS hsm
INNER  JOIN 
    (  SELECT c.*,cu.countryName
       FROM   city c
       INNER  JOIN  country cu ON c.countryid= cu.countryId
    ) co
ON (charindex(co.city,hsm.city) > 0 AND hsm.countryid = co.countryid) AND
    hsm.cityid is null
+14  A: 

If you implement the Levenshtein Distance algorithm as a user-defined function, it will return the number of operations that need to be performed on string_1 so that it becomes string_2. You can then compare the result of the Levenshtein Distance function against a fixed threshold, or against a percentage length of string_1 or string_2.

You would simply use it as follows:

WHERE LD(city_1, city_2) < 4;

Using Full-Text Search may be another option, especially since an implementation of Levenshtein Distance would require a full table scan. This decision may depend on how frequently you intend to do this comparison.

You may want to check out the following Levenshtein Distance implementation for SQL Server:

Daniel Vassallo
i have created function Levenshtein as u said but its giving me error when i am using it Cannot find either column "dbo" or the user-defined function or aggregate "dbo.MIN3", or the name is ambiguous.i am using it in this way (as u said) SELECT hsm.* FROM HotelSourceMap hsm, city cWHERE dbo.LEVENSHTEIN(hsm.city, c.city) < 4
Rajesh Rolen- DotNet Developer
Yes, I forgot to mention that. You need to define the MIN3 function. Use this small script: http://www.tek-tips.com/viewthread.cfm?qid=1194707. Call it MIN3 instead of fnMin3.
Daniel Vassallo
now tell me which one i need to call fnmin3 or Levenshtein.. please explain clearly
Rajesh Rolen- DotNet Developer
ok i got it.. thanks
Rajesh Rolen- DotNet Developer
its not matching correctly if i try < 4 ..... so now i am tring for <2 ...thanks a lot...
Rajesh Rolen- DotNet Developer
How much time it will take to finish?
Rajesh Rolen- DotNet Developer
wowowwowowwww..it giving me results...thanks a lot...
Rajesh Rolen- DotNet Developer
I'm glad it helped. You may need to tweak the threshold a bit, as you noted. You may also want to consider using a variable threshold based on the average length of the two cities, so that a long city name can have more typos than a short one.
Daniel Vassallo
A: 

You must fix the names in the database. Databases are meant for exact matches, not "looks mostly like". The most simple fix is probably to export the table in a CSV format, load it in Excel (two columns: Primary key and city name) and then use a spell checker to fix the names. After all names have been fixed, import the table again.

Aaron Digulla
+8  A: 

You could use Soundex to compare two strings that are spelt different but have a similar pronounciation.

It depends how they are misspelt. If it is just typos, probably use Levenshtein Distance that Daniel Vassallo recommends. If it is misspellings by people who weren't sure how the city was spelt, use Soundex.

Maybe use both!

Mongus Pong
+1 Soundex is another option. I forgot that in my answer.
Daniel Vassallo
+1  A: 

The best solution is to use SOUNDEX. I tried some test: It matches Waterland, Witerland but not Wiperland. I think this should fulfill your requirements. SOUNDEX converts an alpha string to a four-character code to find similar-sounding words or names.

select * from HotelSourceMap where SOUNDEX([city]) = SOUNDEX('Waterland')

==> Match

select * from HotelSourceMap where SOUNDEX([city]) = SOUNDEX('Witerland')

==> Match

select * from HotelSourceMap where SOUNDEX([city]) = SOUNDEX('Wiperland')

==> No Match

Thunder
soundex works well for single word but not for multi-word name like select SOUNDEX('new york'),SOUNDEX('new delhi')
Sharique
+1  A: 

The SoundEx function would be the best option for such scenarios, but only works when the vowels in a word are incorrect or absent. If the consonants mismatch, it would not work. Another approach to do this would be to write a simple logic of defining the appropriate mismatch limit between two words; though would not a give 100% accuracy, might solve the purpose. A simple scalar valued function which uses the SoundeEx function - internally, should be sufficient enough.

Tathagat Verma
A: 

I've had good luck with the Double Metaphone algorithm for fuzzy matches on names. The concept is similar to Soundex in that it boils a word down to a code, but it's much more sophisticated. In my database, I'll have a 'name' field and a 'nameDoubleMetaphone' field that I compute on insert. This makes searches and joins pretty quick.

Wikipedia is a good place to start: http://en.wikipedia.org/wiki/Double_Metaphone

Andrew