ansaurus

Question

Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

Answer 1

A:

Set your User-Agent header to match some real IE/FF User-Agent.

Here's my IE8 useragent string:

Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; InfoPath.3; AskTB5.6)

Stefan Kendall 2010-05-17 00:39:51

Answer 2

+7 A:

You can try lying about your user agent (e.g., by trying to make believe you're a human being and not a robot) if you want to get in possible legal trouble with Barnes & Noble. Why not instead get in touch with their business development department and convince them to authorize you specifically? They're no doubt just trying to avoid getting their site scraped by some classes of robots such as price comparison engines, and if you can convince them that you're not one, sign a contract, etc, they may well be willing to make an exception for you.

A "technical" workaround that just breaks their policies as encoded in robots.txt is a high-legal-risk approach that I would never recommend. BTW, how does their robots.txt read?

Alex Martelli 2010-05-17 00:40:27

Their robots.txt only disallows "/reviews/reviews.asp" - is this what you are scraping?

fmark 2010-05-17 02:43:44

Thanks Alex, I agree... after reading more about robots.txt, this is the best approach. Cheers...@fmark i'm scraping off the video portion... http://video.barnesandnoble.com/robots.txt

Diego 2010-05-18 00:38:27

Answer 3

A:

Without debating the ethics of this you could modify the headers to look like the googlebot for example, or is the googlebot blocked as well?

Steve Robillard 2010-05-17 00:40:48

I don't see any _ethical_ problem but the _legal_ ones could get even worse (whoever you're impersonating could detect you and sue the expletive-deleted out of you, not just B-).

Alex Martelli 2010-05-17 00:51:07

A legal issue is an ethical issue in this case do you follow it or not.

Steve Robillard 2010-05-17 00:53:37

Answer 4

A:

As it seems, you have to do less work to bypass robots.txt, at least says this article. So you might have to remove some code to ignore the filter.

BrunoLM 2010-05-17 00:41:33

Answer 5

+3 A:

Mechanize automatically follows robots.txt, but it can be disabled assuming you have permission, or you have through the ethics through ..

Set a flag in your browser:

browser.set_handle_equiv(False)

This ignores robots.txt.

Also, make sure you throttle your requests, so you don't put too much load on their site. (Note, this also makes it less likely that they will detect and ban you).

wisty 2010-05-17 01:16:23

Hey wisty, what do you mean by throttle your requests?

Diego 2010-05-18 00:39:31

I mean, set a small timeout after each request (i.e. time.sleep(1)), and don't use many threads. I'd use a few threads (in case some get bogged down), and a few seconds sleep.

wisty 2010-05-18 01:21:58

Answer 6

A:

The error you're receiving is not related to the user agent. mechanize by default checks robots.txt directives automatically when you use it to navigate to a site. Use the .set_handle_robots(false) method of mechanize.browser to disable this behavior.

Tom 2010-07-11 23:17:11

Answer 7

A:

oh you need to ignore the robots.txt

br = mechanize.Browser()
br.set_handle_robots(False)

Gunslinger_ 2010-10-03 13:02:38

ansaurus

tags:

views:

answers:

Screen scraping: getting around "HTTP Error 403: request disallowed by robots.txt"

related questions