views:

114

answers:

2

We are using a web scraper and have it set up to have a sleep function which has a random function set up (so that it isn't the same time between each scrape) but we are still getting blocked from Yahoo after 20-30 requests.

Does any one know if there is a limit (i.e: 20 requests per minutes, 200 an hour) Right now our average between each request is around 3-6 seconds. Thanks for any help

A: 

1 request every 3-6 seconds is quite low so perhaps there is another problem with your crawler.

A few ideas:

  • set the User-Agent to something non-suspicious
  • set the Referer header to the same domain
  • try running your crawler from a different IP in case your current IP is blacklisted
  • try maintaining cookies

This will all be easier if you use a higher level library like Mechanize.

Plumo
Thanks for your suggestions...I know that in the past we have used software that randomized our IP addresses...and that seemed to work.
bvandrunen
might be worth testing IP's to try and isolate the problem. Also try slowing the request rate to see if that gets you more than 30 requests. If you use multiple IP's then you can afford to slow the request rate by crawling in parallel.
Plumo
A: 

So the answer is 5000 queries. Taken from

http://forums.digitalpoint.com/showthread.php?t=736784

http:// developer. yahoo. com/search/rate.html

bvandrunen
That rate limit is for their web services. Scraping their results isn't allowed, period; they're not going to document the limit for that, but rest assured it's a lot lower than 5000 times.
Aaronaught