views:

175

answers:

2

I am often writing datascrubs that update millions of rows of data. The data resides in a 24x7x365 OLTP MySQL database using InnoDB. The updates may scrub every row of the table (in which the DB ends up acquiring a table-level lock) or may just be scrubbing 10% of the rows in a table (which could still be in the millions).

To avoid creating massive transaction sizes and minimize contention I usually end up trying to break up my one massive UPDATE statement into a series of smaller UPDATE transactions. So I end up writing a looping construct which restricts my UPDATE's WHERE clause like this:

(warning: this is just pseudo-code to get the point across)

@batch_size=10000;
@max_primary_key_value = select max(pk) from table1

for (int i=0; i<=@max_primary_key_value; i=i+@batch_size)
{
start transaction;

update IGNORE table1
set col2 = "flag set"
where col2 = "flag not set"
and pk > i
and pk < i+@batchsize;

commit;
}

This approach just plain sucks for so many reasons.

I would like to issue an UPDATE statement without the database trying to group all of the records being updated into a single transaction unit. I don't want the UPDATE to succeed or fail as a single unit of work. If 1/2 the rows fail to update... no problem, just let me know. Essentially, each row is it's own unit of work, but batching or cursoring is the only way I can figure out how to represent that to the database engine.

I looked at setting isolation levels for my session, but that doesn't appear to help me in this specific case.

Any other ideas out there?

+1  A: 

Maybe not the answer you are looking for, but you could simplify your code a bit by using LIMIT in the update.

Pseudo-code:

do {
  update table1 set col2 = 'flag set' where col2 = 'flat not set' LIMIT 10000
} while (ROW_COUNT() > 0)
Eric Petroelje
Interesting approach Eric. I like the idea, but it is limited to updates that actually modify the field values that are set in the WHERE clause (such as the example I gave). However, I have the same problem with large updates that don't fit this pattern (e.g. set col1="foo" where col2="bar").
Matthew Quinlan
@Matthew - if that's the case, you could always modify your query to be something like "set col1='foo' where col2='bar' and col1 != 'foo'" which would avoid the issue of a batch of non-modifying updates kicking you out of the loop.
Eric Petroelje
A: 

Yes Eric... you're right (for simple cases that don't include where clauses like "where not in"). I've written a small procedure that allows me to provide an SQL update statement and a limit number as parameters.

create procedure mass_update (IN updatestmt TEXT, IN batchsiz INT) BEGIN -- PURPOSE: break down large update statements into batches to limit transaction size and reduce contention -- LIMITATIONS: only works with UPDATEs that would give "0 rows affected" when executed TWICE!

SET @sql = CONCAT( updatestmt," LIMIT ", batchsiz );
-- had to use CONCAT because "PREPARE stmt FROM" cannot accept dynamic LIMIT parameter
-- reference: http://forums.mysql.com/read.php?98,75640,75640#msg-75640
PREPARE stmt FROM @sql;

select @sql; --display SQL to screen
SET @cumrowcount=0;
SET @batchnum=0;
SET @now := now(); -- @now is a STRING variable... not a datetime

    increment: repeat
        SET @batchnum=@batchnum+1;
        EXECUTE stmt;
        set @rowcount = ROW_COUNT();
        set @cumrowcount = @cumrowcount + @rowcount;
        select @batchnum as "Iteration",
               @cumrowcount as "Cumulative Rows",
               TIMESTAMPDIFF(SECOND,STR_TO_DATE(@now,"%Y-%m-%d %H:%i:%s"),now()) as "Cumulative Seconds",
               now() as "Timestamp";
        until @rowcount <= 0
    end repeat increment;

    DEALLOCATE PREPARE stmt;  -- REQUIRED
END

This seems to work fairly well and I can run it with any old UPDATE statement which adheres to the "running twice results in 0 rows affected" rule.

Thanks for the idea Eric!

Matthew Quinlan