views:

98

answers:

4

Hey Guys,

I'm looking at accepting a project that would require me to clean up an existing e-commerce website. Its been relatively successful and has over 100,000 individual products - loaded both by the client and its publishers.

The site wasn't originally designed for this many products and has become fairly disorganized.

SO, the client has asked I look at a more robust search option - filterable and so forth. I completely agree it needs to be improved, but after looking at the database, I can tell that there are dozens and dozens of categories and not everything is labeled correctly etc.

Is there any database management software that could help me clean up 100,000 entries quickly? Make categories consistent - fix uppercase/lowercase problems etc.

Are there any companies out there that I can source just this particular part of the project to?

Its a massive amount of data-entry. If I spent 2 minutes per product, it would take me 6 months full time to just to complete the database cleanup. I either need to get it down to a matter of seconds per product or find a company that specializes in this type of work.

I don't even know what to search for on Google.

Thanks guys!

--

Thanks everyone for your ideas! I have a lot of options now so I feel a lot more comfortable heading in to this project. Right now I think the direction we will go is to build a tool that allows the client to hire data entry people that can update it as necessary. Then I will work as a consultant, taking care of any UPDATE-WHERE type functions as necessary.

Thanks again!

A: 

I would implement the new search system or whatever and build them a tool that would allow them to easily go through and cleanup the listings, re-categorize, etc. This task requires domain knowledge, so they're the best ones to do it.

Do some number crunching so they can prioritize the list and clean in order of importance.

Marcus Adams
I think this may be the way to go. Considering its such a large database I'm a little wary of hiring temp folks - who could mistype/misunderstand/be sloppy, and while I have experience in database management, this is above the general scale of work I've done before. By building the tool for them, the data upkeep is now their responsibility - and I become a consultant, making minor adjustments as necessary.
Will D. White
+3  A: 

If there are inconsistencies like you are describing, it sounds like the problem may be more an issues of a bad data model (i.e. lack of normalization) than just dirty data. If good normalization is in place, cleaning up categories should be as simple as updating a single record per each category - but if category name is used instead of a foreign key, then you will most likely need to perform a series of UPDATE WHERE statements to clean up the text.

You may want to look into an ETL (extract, transform, load) tool that can help with bulk data transformation. I'm not familiar with ETL tools for mysql, but I'm sure they exist. SQL Server has a build in service called SQL Integration Services that provides the ability to extract data from an existing data source, perform bulk changes or transformations, and then reload the data back into a destination database. Tools like this may help speed up the process of standardizing capitalization, punctuation, changing categories etc.

Even still, don't overlook the possibility that the data model may need tweaking to help prevent this type of situation in the future.

Edit: Wikipedia has a list of opensource ETL products that you may want to investigate.

Ben Elder
Thanks for this. I'll definitely look in to ETL software - Sounds like it could save a whole lot of man-hours.
Will D. White
A: 

Keep in mind that one or your options is to build a crappy interface that somebody can use to edit records, hire half a dozen data-entry people from a temp agency, spend two days training them, and let them go to town.

mjfgates
A: 

In any case you'll probability need to do more than "clean the data", which means you'll need to build new normalized tables. So start there, build a new database that is fully normalized, import the data "as is", with all the duplicate categories, etc.

for example, new tables:

Items
ItemID     int identity/auto number
ItemName   string
CategoryID int
....

Categories
CategoryID    int identity/auto number
CategoryName  string
....

import the bad data into the new system:

Items
ItemID  ItemName CategoryID
1       thing A  1
2       thing B  2
3       thing C  3
4       thing D  1

Categories
CategoryID CategoryName
1          Game
2          food
3          games

now, you can consolidate the data using the PKs

UPDATE Items
    SET CategoryID=1
    WHERE CategoryID=3

DELETE Categories
    WHERE CategoryID=3

You might just write an application where the customer can do the consolidation. Let them select the duplicates on a screen and merge to a selected parent category. you have this application do the merge sql from above.

If there are issues of needing to have a clean cut over date, create an application that generates a series of "Map" tables, where you store the CategoryNameOld="games" and the CategoryNameNew="Game" and use these when you do the conversion/load of the bad data into the new system's tables.

KM