views:

162

answers:

2

Hi there,

I'm currently writing a web crawler (using the python framework scrapy).
Recently I had to implement a pause/resume system.
The solution I implemented is of the simplest kind and, basically, stores links when they get scheduled, and marks them as 'processed' once they actually are.
Thus, I'm able to fetch those links (obviously there is a little bit more stored than just an URL, depth value, the domain the link belongs to, etc ...) when resuming the spider and so far everything works well.

Right now, I've just been using a mysql table to handle those storage action, mostly for fast prototyping.

Now I'd like to know how I could optimize this, since I believe a database shouldn't be the only option available here. By optimize, I mean, using a very simple and light system, while still being able to handle a great amount of data written in short times

For now, it should be able to handle the crawling for a few dozen of domains, which means storing a few thousand links a second ...

Thanks in advance for suggestions

+1  A: 

There was a talk at PyCon 2009 that you may find interesting, Precise state recovery and restart for data-analysis applications by Bill Gribble.

Another quick way to save your application state may be to use pickle to serialize your application state to disk.

John Paulett
I'm pretty sure pickle can't be used because of some objects (from the twisted library). Thanks for the link I'll try to have a look at it ASAP.
Sylvain
Finally took some time to look at the talk. Was interesting. However I think it's a bit beyond my simple needs :-)
Sylvain
+2  A: 

The fastest way of persisting things is typically to just append them to a log -- such a totally sequential access pattern minimizes disk seeks, which are typically the largest part of the time costs for storage. Upon restarting, you re-read the log and rebuild the memory structures that you were also building on the fly as you were appending to the log in the first place.

Your specific application could be further optimized since it doesn't necessarily require 100% reliability -- if you miss writing a few entries due to a sudden crash, ah well, you'll just crawl them again. So, your log file can be buffered and doesn't need to be obsessively fsync'ed.

I imagine the search structure would also fit comfortably in memory (if it's only for a few dozen sites you could probably just keep a set with all their URLs, no need for bloom filters or anything fancy) -- if it didn't, you might have to keep in memory only a set of recent entries, and periodically dump that set to disk (e.g., merging all entries into a Berkeley DB file); but I'm not going into excruciating details about these options since it does not appear you will require them.

Alex Martelli
dozens of sites crawled in parallel, but I will need to keep trace of every crawl jobs done in the past I guess
Sylvain
also, if writing sequentially to a file, how will I 'flag' link as downloaded ?
Sylvain
@Sylvain, then you definitely need to periodically "dump" the in-memory lookaside `set` to a more persistent form of lookup -- and Berkeley DB may or may not scale up smoothly to millions or billions... you'll need to benchmark, but I suspect PostgreSQL (or some ambitious non-relational key/value store, but I have little experience of those apart from Google's own Bigtable) would indeed be your best approach if your scale is sufficiently gigantic. Key point is, you don't need to be updating that DB all the time -- use memory and logs to make DB updates be just "once in a while"!
Alex Martelli
@Sylvain, you append to your logfile(s) lines like 'TODO http://a.b.c' or 'DONE http://a.b.c' (or use shorter "verbs" than `'TODO '` and `'DONE '` of course;-).
Alex Martelli