views:

42

answers:

1

I have 7 MySQL tables that contain partly overlapping and redundant data in approximately 17000 rows. All tables contain names and addresses of schools. Sometimes the same school is duplicated in a table with a slightly different name, and sometimes the same school appears in multiple tables, again, with small differences in its name or address.

My task is to create a single table with an id, name and town/city id field which would contain the data from the 7 tables. There will be a separate table for towns with an id and name field.

This is complicated by the fact that the original tables have no separate field for the city, it has to be extracted from the address field which has data in wildly different formats.

I realize that most of this has to be done manually, I've hardened my soul and I am ready to deal with the living hell this will bring upon me. My questions are:

  • how would you start such a task? what would be an intelligent strategy to automate as much of it as possible?
  • are there any tools available that could make this faster? like something that can compare strings, determine their 'likeness' and suggest possible duplications?

Thanks!

+2  A: 

I did something like this once, although using Oracle rather than MySQL. 17000 records is a manageable number of records but enough to make it worthwhile building tools.

Wikipedia has lists of schools pretty much all over the world, but alas not for Hungary. This is a pity, because it would be better to have list of the data you should have in your system rather than trying to reconstruct it from your fractured databases. It would be useful if you can get hold of such a list so you can compare your records against it rather than against each other. Even if you can just get hold of lists of postcodes and city names that would still help.

When it comes to fuzzy matching strings there are a number of statistical tricks available. A common one is the Levenshtein Distance. This gives a score indicating the similarity between two strings, expressed as the number of changes needed to get from string A to string B (AKA edit distance). Code Janitor has an implementation for MySQL but this is just the top hit in Google, and comes with no warranty from me. Find it here. Jaro Winkler is another matching algorithm, but there seem to be fewer implementations kicking around the Internet.

General processing hints

  • Extract all the school names and addresses into a single table, with metadata indicating each row's provenance (database, table, primary key).
  • Add columns to hold search strings, for the school name and address. For instance, have one column which strips out punctation and numbers (in some matching algorithms THMAS is closer to THOMAS than is TH0MAS).
  • In another column allow for common typos (in a British address a string SW!^ is likely to represent SW16, the postcode for Streatham).
  • Build FullText indexes on those search columns to help pick out occurrences of common words like city names.
  • If you had a huge amount of data and some patience, building a thesaurus to identify common contractions like Rd, St, Blvd might be a useful exercise, but it probably isn't worth it for 17000 rows.
  • Use regex to match patterns, such as postcodes.
APC
Thank you for the excellent advice! I wrote a quick perl script to download a list of all Hungarian schools from the Ministry of Education website and generate sql insert statements from the data. This saved me about 2 weeks. Thanks again!!
Botond Balázs