views:

27

answers:

3

I have a cron that runs through many rows, deleting the "bad" ones (according to my criteria). I'm just wondering what would be the best to optimize the script. I can do one of the following:

  1. Have the same cron instantly delete the "bad" rows upon finding them.

  2. Have the same cron instantly update the "bad" rows to status "1", meaning bad. Then, I can set up another cron that runs onces an hour deleting all rows with status "1". To make it faster, I guess I'd have an index on "status", but that might also ruin performance.

Any suggestions?

+1  A: 

I am not experienced in mySQL but on other DBMS I worked on an update and then delete does not help. Just try with huge amounts of data and measure the time on the delete versus update+delete. It helps if the columns that act as a criteria for the "bad" ones have an index.

Jürgen Hollfelder
+1  A: 

An index on a field with two possible values is not as useful as you might think, particularly if you're continually changing the field that you're indexing. As an example, let's say you have a table with 100,000 rows of data, and initially 'status' is set to 0 for each row (after a delete cycle, and before an update cycle). At that point in time, using that index is equivalent to doing a sequential search of the table. If you update 1,000 rows, marking their status as 1, then your index will need to be updated (and possibly rebalanced) 1,000 times. Finally when you delete all the rows with status == 1, then you'll be able to take advantage of the index (you only look at 1% of the rows), but you'll need to update the index 1,000 times (in addition to deleting the rows).

IMO, you're better off selecting the 'bad' rows directly and deleting them immediately -- you eliminate the overhead of an index you're not using well, and the overhead of a second query.

Note: depending upon your database, deletes may be a very fast operation, or remarkably slow. Ultimately, deleting a row involves marking a row as unused, and then returning the space occupied by that row to the table so that new rows can be inserted. This is complicated by variable-length rows (as a result of variable-length datatypes), and by the internal implementation details. As an example, PostgresQL merely marks a row as deleted, and then uses a separate manually-invoked process (vacuum) to return the space used by the deleted rows to the table for use for new rows. I believe that PostegresQL still treats row updates as a delete, followed by an insert. MySQL and Oracle and SQL Server all have different methods of achieving the same end result, each with more complicated side-effects on system performance.

You'll need to study your documentation and any performance guides to decide what would be best for your system.

Craig Trader
A: 

IF you are thinking of updating a row as bad and then later deleting it, you are putting extra pressure on the server.

Deleting them directly is the better option.

If you think there are going to be huge amount of bad rows, make a cron in such a way that it does not delete more than 100 rows at a time. That should limit the server load to some extend.

Starx