views:

62

answers:

2

I have a huge list of URLs in a MySQL InnoDB table, and worker processes that query MySQL for a set of URLs to process. The URLs should immediately be marked as being processed, so that other worker processes do not waste resources by starting to process the same ones.

Currently I first do this to get some URLs:

SELECT DISTINCT url FROM urls WHERE task_assigned is NULL ORDER BY id LIMIT 100

Then in code I naively loop through each of those urls to mark it as being processed:

UPDATE urls SET task_assigned = NOW() WHERE url = ? COLLATE utf8_bin

I'm perfectly aware how silly and inefficient this is. More importantly there is no guarantee that another worker process wouldn't try to get a list in the middle of my UPDATEs. What's the beautiful way to do this? Should I make it a transaction, how?

A: 

Maybe you should just select all the URLs first and then use threads to parse them asynchronously?

Sergej Andrejev
Actually there are several computers processing the URLs, and I'm using HTTP requests to pass the lists.
Bemmu
+2  A: 

The following appears (by a quick glance at the MySQL 5 manual) to be available in MySQL; I'm not sure if it's the best approach, but is one I have used before in PostgreSQL:

BEGIN TRANSACTION;
SELECT DISTINCT url FROM urls WHERE task_assigned is NULL ORDER BY id LIMIT 100 FOR UPDATE;
UPDATE urls SET task_assigned = NOW() WHERE url IN [list of URLs] COLLATE utf8_bin;
COMMIT;

Actually in PostgreSQL I would use a single UPDATE statement with the RETURNING clause of UPDATE taking the place of the SELECT, but that is a PostgreSQL-specific extension.

One potential problem I see with your approach is duplicate URLs: if url http://www.example.com/ appears twice in your table, say with IDs 23 and 42, it will get returned with one of those two IDs by the SELECT but the UPDATE will affect both rows. I don't know if that behavior makes sense in your application; I would probably put some kind of UNIQUE constraint on URLs so it couldn't happen, and then use a list of IDs, not URLs, in the IN clause (which ought to be faster).

kquinn
Thanks. However can you think of pure SQL way of doing it without having to create the comma-separated [list of URLs] in code first?
Bemmu
Well, you can always just replace that bit with a subquery (copy and paste the SELECT statement). I don't know how well that would perform... probably better than the code version, actually.
kquinn