I recently found and fixed a bug in a site I was working on that resulted in millions of duplicate rows of data in a table that will be quite large even without them (still in the millions). I can easily find these duplicate rows and can run a single delete query to kill them all. The problem is that trying to delete this many rows in one shot locks up the table for a long time, which I would like to avoid if possible. The only ways I can see to get rid of these rows, without taking down the site (by locking up the table) are:
- Write a script that will execute thousands of smaller delete queries in a loop. This will theoretically get around the locked table issue because other queries will be able to make it into the queue and run in between the deletes. But it will still spike the load on the database quite a bit and will take a long time to run.
- Rename the table and recreate the existing table (it'll now be empty). Then do my cleanup on the renamed table. Rename the new table, name the old one back and merge the new rows into the renamed table. This is way takes considerably more steps, but should get the job done with minimal interruption. The only tricky part here is that the table in question is a reporting table, so once it's renamed out of the way and the empty one put in its place all historic reports go away until I put it back in place. Plus the merging process could be a bit of a pain because of the type of data being stored. Overall this is my likely choice right now.
I was just wondering if anyone else has had this problem before and, if so, how you dealt with it without taking down the site and, hopefully, with minimal if any interruption to the users? If I go with number 2, or a different, similar, approach, I can schedule the stuff to run late at night and do the merge early the next morning and just let the users know ahead of time, so that's not a huge deal. I'm just looking to see if anyone has any ideas for a better, or easier, way to do the cleanup.