views:

611

answers:

9

I just started thinking about creating/customizing a web crawler today, and know very little about web crawler/robot etiquette. A majority of the writings on etiquette I've found seem old and awkward, so I'd like to get some current (and practical) insights from the web developer community.

I want to use a crawler to walk over "the web" for a super simple purpose - "does the markup of site XYZ meet condition ABC?".

This raises a lot of questions for me, but I think the two main questions I need to get out of the way first are:

  • It feels a little "iffy" from the get go -- is this sort of thing acceptable?
  • What specific considerations should the crawler take to not upset people?
+1  A: 

I'd say that it is very important to consider how much load you are causing. For instance, if your crawler requests every object of a single site, more or less at once, it might cause load problems for that particular site.

In other words, make sure your crawler is not too aggressive.

Einar
+4  A: 

Obey robots.txt (and not too aggressive like has been said already).

You might want to think about your user-agent string - they're a good place to be up-front about what you're doing and how you can be contacted.

Will Dean
+1  A: 

It's perfectly accetable to do - just make sure it only visits each page once for each session. As you're technically creating a searchbot you must obey robots.txt and no-cache rules. People can still block your bot specifically if needed by blocking IPs.

You're only looking for source code as far as I can tell so you'll want to build something to follow <link>s for stylesheets and <script src="..."></script> for JavaScripts.

Ross
+1  A: 

Load is a big consideration. Put limits on how often you crawl a particular site and what is the most basic info you need to accomplish your goal. If you are looking for text do not download all images, stuff like that.

Of course obey robots.txt but also make sure your user agent string includes accurate contact info and maybe a link to a web page describing what you are doing and how you do it. If a web admin is seeing a lot of requests from you and is curious you might be able to answer a lot of questions with an informative web page.

Sean
+2  A: 

Besides WillDean's and Einar's good answers, I would really recommend you take a time to read about the meaning of the HTTP response codes, and what your crawler should do when encountering each one, since it will make a big a difference on your performance, and on wether or not you are banned from some sites.

Some useful links:

HTTP/1.1: Status Code Definitions

Aggregator client HTTP tests

Wikipedia

Ricardo Reyes
+1  A: 

Please be sure to include a URL in your user-agent string that explains who/what/why your robot is crawling.

ceejayoz
+2  A: 

Also do not forget to obey the bot meta tags: http://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.2

Another thing to think about - when spider pages, don't be too hasty deciding things doesn't exits or have errors. Some pages are offline due to maintenance work or errors that are corrected within a short period.

Boaz
+1  A: 

All good points, the ones made here. You will also have to deal with dynamically-generated Java and JavaScript links, parameters and session IDs, escaping single and double quotes, failed attempts at relative links (using ../../ to go past the root directory), case sensitivity, frames, redirects, cookies....

I could go on for days, and kinda have. I have a Robots Checklist that covers most of this, and I'm happy answer what I can.

You should also think about using open-source robot crawler code, because it gives you a huge leg up on all these issues. I have a page on that as well: open source robot code. Hope that helps!

A: 

You will need to add some capability to blacklist sites / domains or other things (IP ranges, ASN, etc) to avoid your spider getting bogged down with spam sites.

You'll need to have a HTTP implementation with a lot of control over timeout and behaviour. Expect a lot of sites to send back invalid responses, huge responses, rubbish headers, or just leave the connection open indefinitely with no response etc.

Also don't trust a 200 status to mean "the page exists". Quite a large proportion of sites send back 200 for "Not found" or other errors, in my experience (Along with a large HTML document).

MarkR