views:

148

answers:

2

I just accepted a similar question (http://stackoverflow.com/questions/1690748/php-mysql-queue), but I realized that it wasn't the correct question for my problem, but was the correct answer for my question :)

I have a MySQL (MyISAM type) table of sites to be scraped by workers.

CREATE TABLE `site` (
  `id` int(11) NOT NULL auto_increment,
  `url` text,
  `last_pop` int(13) default NULL,
  `md5` varchar(32) default NULL,
  `disabled` tinyint(1) default '0',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `md5` (`md5`),
) ENGINE=MyISAM  DEFAULT CHARSET=latin1;

What I need is to scrap one site per worker without repeat. So, if I have 3 sites and 2 workers the system needs to work like this:

      ID URL   LAST_POP
t4    1  site1 t1         <- worker1 scrap site1
t4    2  site2 t2         <- worker2 scrap site2
t5    3  site3 t3         <- worker1 scrap site3
t6    1  site2 t4         <- worker2 scrap site2
t6    2  site1 t4         <- worker1 scrap site1
t7    3  site3 t5         <- worker2 scrap site3
....

It's like a cycled queue orderer by last_pop ASC.

How can I do that?

A: 

Why not add an extra boolean column STATUS so that you can order by LAST_POP. So when a site is selected by a worker for scrapping run a second query with UPDATE site SET status = '1'. And when the next worker selects the next site run a query with SELECT * FROM site WHERE status = '0' ORDER BY last_pop ASC.

Justin Lucas
+1  A: 

You'll probably want to keep track of two pieces of information for each site: when it was last scraped and if it is currently being scraped.

Using the answer for your other question, set the scraping field to the id of the worker to lock it from other workers. When the worker is finished with the task set the scraping field back to null and the last_scrape date to the current time.

CREATE TABLE `site` (
  `id` int(11) NOT NULL auto_increment,
  `url` text,
  `last_scrape` TIMESTAMP,
  `scraping` tinyint(1) default NULL,
  `md5` varchar(32) default NULL,
  `disabled` tinyint(1) default '0',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `md5` (`md5`),
) ENGINE=MyISAM  DEFAULT CHARSET=latin1;

Lock and retrieve the next job (the site that was last scraped the longest time ago):

Update site
  set `scraping` = '$worker_id' 
  where `scraping` is null 
  order by `last_scrape` ASC limit 1;

$job = 
  Select * from site
  where `scraping` = '$worker_id'

Release job back to queue:

Update site
  set `scraping` = NULL,
  `last_scrape` = NOW()
  where `scraping` = '$worker_id';
GApple
you could potentially combine the use of the timestamp and scraping status, as long as their values won't conflict. If you may have 100 workers, just select the job where `last_scape` > 101 and set `last_scrape` to the worker id up to 100 to specify it's being processed. Make sure you initialize the field to a larger value as well, or they could appear already locked to all workers.
GApple
that's good enough. thanks.
inakiabt