views:

296

answers:

3

I was going to crawl a site for some research I was collecting. But, apparently the terms of service is quite clear on the topic. Is it illegal to now "follow" the terms of service. And what can the site normally do?

Here is an example clause in the TOS. Also, what about sites that don't provide this particular clause.

Restrictions: "use any robot, spider, site search application, or other automated device, process or means to access, retrieve, scrape, or index the site"

It is just research?

Edit: "OK, from the standpoint of designing an efficient crawler. Should I provide some form of natural language engine to read terms of service and then abide by them."

+2  A: 

To answer your revised question; as others have said robots.txt is the programmatic means of determining what you can and can't scrape. That said, if you're thinking of crawling a single/particular website that you know has limiting terms-of-service then you should check with the site-owner.

Another rule of thumb I've heard somewhere is that you should space your requests based on the response time of the previous request (i.e. if request one has a response time of 1000ms, then you should pause an additional 1000ms before issueing request #2). It's a simple metric for throttling the bandwidth, and will at least help prevent your spider from hogging the server.

STW
A: 

When you make a crawler be thoughtful of the resources of the scraped site server ...

I mean do it in a mild manner... do not open 1000 connections per second...

In the legality issue .. do as Yoooder suggests :)

Gaby
+1  A: 

OK, from the standpoint of designing an efficient crawler. Should I provide some form of natural language engine to read terms of service and then abide by them.

Of course not!! You should write your web crawler to parse and obey the restrictions set out in the "robots.txt" file. And you should make sure that your crawler doesn't cause inconvenience to the site by crawling too vigorously, visiting repeatedly, etcetera.

It is not entirely clear what the legal situation is. For example, if the "terms of use" say one things but the "robots.txt" says something different. For a legal opinion on that question, you'd need to talk to a lawyer.

But you probably should plan to have a configuration file where you can manually list sites that should not be crawled. And if you come across a site where the Terms of Service seem to say that you should not crawl them, you would be well advised not to crawl them ... no matter what the site's "robots.txt" file says. Especially, if the site owners have brought the ToS to your attention!

Stephen C
Basically, do you have to follow the terms of service? Why would they have a terms of service is crawlers/robots don't follow them.
Berlin Brown
@Berlin - I'd say yes ... especially if they have been brought your attention. But you need to ask a lawyer if you want legal advice.
Stephen C
You just said "of course not"
Berlin Brown
@Berlin - "Why would they have a terms of service is crawlers/robots don't follow them.". Why do we have speed limits on the roads if people don't follow them???
Stephen C
@Berlin - I said "of course not" to the (impractical) idea of parsing / understanding ToS documents.
Stephen C
impractical but not impossible. and I liked your answer the best.
Berlin Brown
@Berlin also consider that ToS or not if any webmaster sees your IP drinking data you're likely to be banned from the site (along with anybody sharing your same public IP).
STW
@Yoooder - "likely" is too strong, IMO. If your crawler ignores the "robots.txt" protocol or if you generate too much traffic, banning is likely. But if your web crawler behaves itself, most webmasters are likely to turn a blind eye. (They have better things to do with their time than to act as traffic cops for web crawlers. If they care at all, the FIRST thing they are likely to do is setup a "robots.txt" file.)
Stephen C