views:

124

answers:

3

Hi,

I'm making a little bot to crawl a few websites. Now, I'm just testing it out right now and I tried 2 types of settings :

  1. about 10 requests every 3 seconds - the IP got banned, so I said - ok , that's too fast.

  2. 2 requests every 3 seconds - the IP got banned after 30 minutes and 1000+ links crawled .

Is that still too fast ? I mean we're talking about close to 1.000.000 links should I get the message that "we just don't want to be crawled ?" or is that still too fast ?

Thanks.

Edit

Tried again - 2 requests every 5 seconds - 30 minutes and 550 links later I got banned .

I'll go with 1 request every 2 seconds but I suspect the same will happen. I guess I'll have to contact an admin - if I can find him.

+7  A: 

Here are some guidelines for web crawler politeness.

Typically, if a page takes x amount of seconds to download, it is polite to wait at least 10x-15x before re-downloading.

Also make sure you are honoring robots.txt as well.

yx
A: 

Yes. It is too fast.

Generally the crawlers keep a rate of 1 requests per minute.

Honestly It is a low crawling rate. But after few minutes you can have a queue of URLs (a long list :) ). You can rotate over this list until the next turn to the particular url comes.

If you have an option of having some sort of distributed architecture (Multiple nodes with different network connections even HyperVs or VMs) you may think of a higher speed. The different hosts in the grid can grab the contents more effectively.

Chathuranga Chandrasekara
1 minute ? damn , that's extremely slow - I'll be finished by ... next year ?
sirrocco
Yes.. But again I can see a different behaviour with offline browsers. Like winhttrack. They are too agressive but I can't imagine the difference.
Chathuranga Chandrasekara
A: 

One of the best considerations to take into account is the site owners. As others have mentioned the robots.txt file is the standard for site's to do this.

In short you have 3 ways in robots.txt that are used to limit request speed.

Crawl-delay: # , an integer which represents the amount in seconds to wait between requests.

Request-rate: # / # , the numerator representing how many pages and the denominator representing how many per seconds. i.e: 1/3 = 1 page every 3 seconds.

Visit-time: ####-#### , two 4 digit numbers separated by hyphen which represent the time (HH:MM GMT based) that you should crawl their site.

Given these suggestions/requests you may find some sites do not have any of these in their robots.txt, in which its in your control. I would suggest keeping it to a reasonable rate at a minimum of 1 page per second while also limiting how many pages you consume a day.

Pat