I have 7 MySQL tables that contain partly overlapping and redundant data in approximately 17000 rows. All tables contain names and addresses of schools. Sometimes the same school is duplicated in a table with a slightly different name, and sometimes the same school appears in multiple tables, again, with small differences in its name or address.
My task is to create a single table with an id, name and town/city id field which would contain the data from the 7 tables. There will be a separate table for towns with an id and name field.
This is complicated by the fact that the original tables have no separate field for the city, it has to be extracted from the address field which has data in wildly different formats.
I realize that most of this has to be done manually, I've hardened my soul and I am ready to deal with the living hell this will bring upon me. My questions are:
- how would you start such a task? what would be an intelligent strategy to automate as much of it as possible?
- are there any tools available that could make this faster? like something that can compare strings, determine their 'likeness' and suggest possible duplications?
Thanks!