views:

320

answers:

2

I am using mySQL from their C API, but that shouldn't be relevant. My code must process records from a table that match some criteria, and then update the said records to flag them as processed. The lines in the table are modified/inserted/deleted by another process I don't control. I am afraid in the following, the UPDATE might flag some records erroneously since the set of records matching might have changed between step 1 and step 3.

SELECT * FROM myTable WHERE <CONDITION>; # step 1
<iterate over the selected set of lines. This may take some time.> # step 2
UPDATE myTable SET processed=1 WHERE <CONDITION> # step 3

What's the smart way to ensure that the UPDATE updates all the lines processed, and only them? A transaction doesn't seem to fit the bill as it doesn't provide isolation of that sort: a recently modified record not in the originally selected set might still be targeted by the UPDATE statement. For the same reason, SELECT ... FOR UPDATE doesn't seem to help, though it sounds promising :-)

The only way I can see is to use a temporary table to memorize the set of rows to be processed, doing something like:

CREATE TEMPORARY TABLE workOrder (jobId INT(11));
INSERT INTO workOrder SELECT myID as jobId FROM myTable WHERE <CONDITION>;
SELECT * FROM myTable WHERE myID IN (SELECT * FROM workOrder);
<iterate over the selected set of lines. This may take some time.>
UPDATE myTable SET processed=1 WHERE myID IN (SELECT * FROM workOrder);
DROP TABLE workOrder;

But this seems wasteful and not very efficient.

Is there anything smarter?

Many thanks from a SQL newbie.

+1  A: 

There are several options:

  1. You could lock the table

  2. You could add an AND foo_id IN (all_the_ids_you_processed) as the update condition.

  3. you could update before selecting and then only selecting the updated rows (i.e. by processing date)

dbemerlin
1- Locking the table means preventing the other process from working while I am processing. It would not be happy...2- Without a temporary table, how can you express all_the_ids_you_processed?3- I feel a bit uncomfortable timestamping the processed records ahead of time so that I can select them later as it's invasive: a new "timeStamp" column is needed. Also, timeStamps are tricky, as time is not alway monotonous: eg it can regress (when Daylight Saving Time goes off for example).Then again, is the temporary table a really bad idea?
Jean-Denis Muys
Instead of timestamping, use a multi-value status field. Let unprocessed rows have status 0 (say); you do an Update to set status to 1 (meaning in-process) on your selected rows, then do the processing, then do another Update to set the status to 2 (meaning inished processing).
vincebowdren
yes I guess that can work in the problem as I described it. But I finally decided to memorize the ids of the processed records in the client app.
Jean-Denis Muys
A: 

I eventually solved this issue by using a column in that table that flags lines according to their status. This column let's me implement a simple state machine. Conceptually, I have two possible values for this status:

kNoProcessingPlanned = 0; #default "idle" value
kProcessingUnderWay = 1;

Now my algorithm does something like this:

UPDATE myTable SET status=kProcessingUnderWay WHERE <CONDITION>; # step 0

SELECT * FROM myTable WHERE status=kProcessingUnderWay; # step 1
    <iterate over the selected set of lines. This may take some time.> # step 2
UPDATE myTable SET processed=1, status=kNoProcessingPlanned WHERE status=kProcessingUnderWay # step 3

This idea of having rows in several states can be extended to as many states as needed.

Jean-Denis Muys