views:

312

answers:

2

I'm working on a multi-process spider in Python. It should start scraping one page for links and work from there. Specifically, the top-level page contains a list of categories, the second-level pages events in those categories, and the final, third-level pages participants in the events. I can't predict how many categories, events or participants there'll be.

I'm at a bit of a loss as to how best to design such a spider, and in particular, how to know when it's finished crawling (it's expected to keep going till it has discovered and retrieved every relevant page).

Ideally, the first scrape would be synchronous, and everything else async to maximise parallel parsing and adding to the DB, but I'm stuck on how to figure out when the crawling is finished.

How would you suggest I structure the spider, in terms of parallel processes and particularly the above problem?

+1  A: 

I presume you are putting items to visit in a queue, exhausting the queue with workers, and the workers find new items to visit and add them to the queue.

It's finished when all the workers are idle, and the queue of items to visit is empty.

When the workers take advantage of the queue's task_done() method, The main thread can join() the queue to block until it's empty.

Joe Koberg
Hmm. How would I know if the queue is empty because everything's finished, or because, say, there are fewer categories than worker processes, which would empty the queue even though it's far from finished?
wbg
Sorry, I posted too soon. I've thought more about what you said, and multiprocessing.JoinableQueue.task_done() and .join() are just what I'm looking for. I just need to be sure to add new tasks to the queue before calling task_done().Thanks!
wbg
+1  A: 

You might want to look into Scrapy, an asynchronous (based on Twisted) web-scraper. It looks like for your task, the XPath description for the spider would be pretty easy to define!

Good luck!

(If you really want to do it yourself, maybe consider having small sqlite db that keeps track of whether each page has been hit or not... or if it's reasonable size, just do it in memory... Twisted in general might be your friend for hit.)

Gregg Lind
I already have the component modules and classes (parsers, db etc.), but I'm stuck as to how to glue them together.If I keep track of pages I've hit, how do I know when I've finished the /last/ page?
wbg
I'm imagining (in a synchronous system), you'd keep a queue or stack (adding pages when look a group page, or whatever) and when it gets to empty, you're done.
Gregg Lind
Synchronous is easy. I think I've got it licked, thanks. I hadn't understood task_done() properly.
wbg