views:

853

answers:

4

I'm considering writing a simple web scraping application to extract information from a website that does not seem to specifically prohibit this.

I've checked for other alternatives (eg RSS, web service) to get this information, but there are none available at this stage.

Despite this I've also developed/maintained a few websites myself and so I realize that if web scraping is done naively/greedily it can slow things down for other users and generally become a nuisance.

So, what etiquette is involved in terms of:

  1. Number of requests per second/minute/hour.
  2. HTTP User Agent content.
  3. HTTP Referer content.
  4. HTTP Cache settings.
  5. Buffer size for larger files/resources.
  6. Legalities and licensing issues.
  7. Good tools or design approaches to use.
  8. Robots.txt, is this relevant for web scraping or just crawlers/spiders?
  9. Compression such as GZip in requests.

Update

Found this relevant question on Meta: Etiquette of Screen Scaping StackOverflow. Jeff Atwood's answer has some helpful recommendations.

Other related StackOverflow questions:

Options for html scraping

Legalities of screen scraping

+2  A: 

This really depends on what you're scraping, and how much of it you're scraping. For instance, I had to write a scraper about a week ago to crawl several hundred pages. To be generous, I placed a one second wait after each page. Took a few minutes to get the data back, but I'm sure the owner of the site would appreciate any slack I can leave in the process.

Jonathan Sampson
Fair point, but why 1 second? Any reason?
Ash
I was being very generous. One short request at a time.
Jonathan Sampson
I've seen requests for scrapers to wait 15 seconds between GETs.
Charles Stewart
+5  A: 

I would suggest emailing the webmaster, tell them you are writing a non-malicious script etc and ask what they are happy with you hitting and how often.

we run a domain crawler which picks up pdf/word docs etc from friendly domains and the most we've had requested is a 5 second gap between requests and only running at night.

runrunraygun
Good suggestion regarding sending an email, if you even get a response. Also, what different considerations are there between writing a crawler and simple web scraping?
Ash
What do you normally place in the UserAgent? It's often pretty important to many sites as to how they handle your request.
Ash
I don't think there is too much of a difference from your targets point of view. The difference would be the same as being a considerate driver who knows where they're going, and a considerate driver who doesn't know where they're going.From an implementation point of view again not massively different, just a question of identifying hyperlinks and making up your target list as you go rather than scraping a predefined list of hrefs.
runrunraygun
For userAgent we just use our company name, but we're crawling/scraping people who are expecting our traffic so i dunno.
runrunraygun
+11  A: 

robots.txt is relevant: look at it, to get an idea of what the site's attitude to non-human readers. Showing some awareness of its contents will provide some reassurance to the webadmin when you email them that you will take care to respect the site when you scrape.

Charles Stewart
+12  A: 

Do conform to the site's robot.txt request, this is probably one of the best and most ethical ways of coming to an agreement without speaking to anyone on the site.

Do identify yourself appropriately in the UserAgent header. By doing this, the site can see who you are and restrict/allow certain areas of their site to you explicitly. For example look at the big guy's user agents, Google is listed below, and devise one similar which has a page describing who you are and how to inform your bots crawling.

Googles user-agent string : Googlebot/1.0 ([email protected] http://googlebot.com/)

Do use compression gzip/deflate if site supports, this saves you time and the site bandwidth.

You should be ok from a legal standpoint (although I am no attorney nor legal expert) should you follow their robots.txt AND terms of service.

In the end however I think the best advice was from runrunraygun considering its a lone site. Contacting the administrator and seeing what would be acceptable and respecting their wishes will get you far.

Pat