tags:

views:

151

answers:

3

I'm building a search engine (for fun) and it has just struck me that potentially my little project might wreak havok by clicking on ads and all sorts of problems.

So what are the guidelines for good webcrawler 'Etiquette'?

Things that spring to mind:

  1. Observe Robot.txt instructions
  2. Limit the number of simultaneous requests to the same domain
  3. Don't follow ad links?

Stopping the crawler from clicking on ads - This one is particularly on my mind at the moment... how do i stop my bot from 'clicking' on ads? if it is going straight to the url in the ad is it counted as a click?

+2  A: 

Don't follow links marked as rel="nofollow".

Also, you don't have to worry about ads. If you spider only HTML text of a page, then in most cases you won't get ad links there - they are generated on client using javascript.

Michał Chaniewski
+3  A: 

You don't read only the robots.txt instruction. You should also see the meta tags with noindex and nofollow.

About the ad question, I'm not sure, but I guess if you just read the links and then some other time enter the page, the entered page will have no info on how you got that address, and can't charge the site for the "pseudoclick"

Samuel Carrijo
+2  A: 

Couple of things that will help webmasters, and therefore make your webcrawler more 'nice' are:-

  • Support for sitemap - Low resource/bandwidth check of what's changed.
  • Support If-Modified-Since GET's - Don't waste webmasters bandwidth!
  • Support for content compression if server supports it.
  • Webcrawler webpage in UA - Make sure your webcrawler's UserAgent includes a link to a webpage giving details of what,why,how your engine works