views:

67

answers:

3

This was the closest question to my question and it wasn't really answered very well imo:

http://stackoverflow.com/questions/2022030/web-scraping-etiquette

I'm looking for the answer to #1:

How many requests/second should you be doing to scrape?

Right now I pull from a queue of links. Every site that gets scraped has it's own thread and sleeps for 1 second in between requests. I ask for gzip compression to save bandwidth.

Are there standards for this? Surely all the big search engines have some set of guidelines they follow in regards to this.

+1  A: 

There is no set standard for this, it depends on how much load the web scraping causes. As long as you aren't noticeably effecting the speed of the site for other users, it should be an acceptable scraping speed.

Since the amount of users and load on a website fluctuates constantly, it'd be a good idea to dynamically adjust your scraping speed.

Monitor the latency of downloading each page, and if the latency is starting to increase, start to decrease your scraping speed. Essentially, the website's load/latency should be inversely proportional to your scraping speed.

Nick
I really like the latency idea! that IS a good idea!
feydr
+1  A: 

When my clients/boss ask me to do something like this I usually look for a public API before I resort to scraping of the public site. Also contacting the site owner or technical contact and asking permission to do so will keep the "cease and desist" letters to a minimum.

Geek Num 88
assume no api exists and assume the owner won't respond
feydr
in that scenario I would make the script that is doing the scraping mimic a user. For example a user would not usually click through 20 pages in under a 3 seconds. typically in my uses I would stay around 1 request per site per 3 seconds.
Geek Num 88
+1  A: 

the wikipedia article on web crawling has some info about what others are doing:

Cho[22] uses 10 seconds as an interval for accesses, and the WIRE crawler [28] uses 15 seconds as the default. The MercatorWeb crawler follows an adaptive politeness policy: if it took t seconds to download a document from a given server, the crawler waits for 10t seconds before downloading the next page.[29] Dill et al. [30] use 1 second.

I generally try 5 seconds with a bit of randomness so it looks less suspicious.

Plumo