views:

330

answers:

3

I have a mySQL database filled with one huge table of 80 columns and 10 million rows. The data may have inconsistencies.

I would like to normalize the database in an automated and efficient way.

I could do it using java/c++/..., but I would like to do as much as possible inside the database. I guess that any work outside the database will slow down things very much.

Suggestions on how to do it? What are good resources/tutorials to start with?

I am not looking for any hints on what normalization is (found plenty of this stuff using google)!

+3  A: 

I can't think of any way you can automate it. You would have to create the tables that you want, and then go through and replace each piece of data with manual queries.

e.g.,

INSERT INTO contact
SELECT DISTINCT first_name, last_name, phone
FROM massive_table;

then you could drop the columns out of the massive table and replace it with a contact_id column.

You would have a similar process when pulling out rows that go into a one-to-many table.

Brian Ramsay
+2  A: 

You need to study the columns to identify 'like' entities and break them out into seperate tabels. At best an automated tool might identify groups of rows with identical values for some of the columns, but a person who understood the data would have to decide if those truely belong as a seperate entity.

Here's a contrived example - suppose your columns were first name, last name, address, city, state, zip. An automated tool might identify rows of people who were members of the same family with the same last name, address, city, state, and zip and incorrectly conclude that those five columns represented an entity. It might then split the tables up:

First Name, ReferenceID

and another table

ID, Last Name, Address, City, State, Zip

See what i mean?

n8wrl
+2  A: 

In cleaning up messy data, I like to create user defined mysql functions to do typical data-scrubbing stuff... that way you can reuse them later. Approaching this way also lets you see if you can find existing udf's that have been written which you can use (with or without modification)... for example mysqludf.org

codemonkey
Thank you, mysqludf is the kind of resource I am looking for.
CL23