tags:

views:

193

answers:

4

I am writing a crawler in Python, in order to make Ctrl+C not to cause my crawler to start over in next run, I need to save the processing deque in a text file (one item per line) and update it every iteration, the update operation needs to be super fast. In order not to reinvent the wheel, I am asking if there is an established module to do this?

+4  A: 

As an alternative, you could set up an exit function, and pickle the deque on exit.

Exit function
Pickle

gnud
+1  A: 

You should be able to use pickle to serialize your lists.

codelogic
A: 

Some things that come to my mind:

  • leave the file handle open (don't close the file everytime you wrote something)
  • or write the file every n items and catch a close signal to write the current non-written items
daddz
+1  A: 

I am not sure if I understood the question right, I am just curious, so here are few questions and suggestions:

Are you planning to catch the Ctrl+C interrupt and do the deque? What happens if the crawler crashes for some arbitrary reason like an unhandled exception or crash? You loose the queue status and start over again? from the documentation:

Note

The exit function is not called when the program is killed by a signal, when a Python fatal internal error is detected, or when os._exit() is called.

What happens when you happen to visit the same URI again, are you maintaining a visited list or something?

I think you should be maintaining some kind of visit and session information / status for each URI you crawl. You can use the visit information to decide to crawl a URI or not when you visit the same URI next time. The other info - session information - for the last session with that URI will help in picking up only the incremental stuff and if the page is not change no need to pick it up saving some db I/O costs, duplicates, etc.

That way you won't have to worry about the ctrl+C or a crash. If the crawler goes down for any reason, lets say after crawling 60K posts when 40K more were left, the next time crawler fills in the queue, though the queue may be huge but the crawler can check if the it has already visited the URI or not and what was the state of the page when it was crawled - optimization - does the page requires a new pick up coz it has changed or not.

I hope that is of some help.

JV
As I've tested, atexit registered function gets called in both cases of CTRL+C and unhandled exceptions, and I have figured out that visited set also needs to be pickled in order to restore program state. Your advice is helpful, thank you very much.
btw0