views:

3327

answers:

6

I try to fetch a Wikipedia article with Phython's urllib:

f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes")           
s = f.read()
f.close()

However instead of the html page I get the following response: Error - Wikimedia Foundation:

Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to () Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT

Wikipedia seems to block request which are not from a standard browser.

Anybody know how to work around this?

+1  A: 

Try changing the user agent header you are sending in your request to something like: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008072820 Ubuntu/8.04 (hardy) Firefox/3.0.1 (Linux Mint)

Vasil
A: 

You don't need to impersonate a browser user-agent; any user-agent at all will work, just not a blank one.

Gurch
urllib and urllib2 both send a user agent
Teifion
+13  A: 

It is not a solution to the specific problem. But it might be intersting for you to use the mwclient library (http://botwiki.sno.cc/wiki/Python:Mwclient) instead. That would be so much easier. Especially since you will directly get the article contents which removes the need for you to parse the html.

I have used it myself for two projects, and it works very well.

kigurai
Using third party libraries for what can easily be done with buildin libraries in a couple lines of code isn't good advice.
Florian Bösch
Since mwclient uses the mediawiki api it will require no parsing of the content. And I am guessing the original poster wants the content, and not the raw html with menus and all.
kigurai
+13  A: 

You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent.

Straight from the examples

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()
Florian Bösch
Gurch, Teifion, holding a grudge because you get downvoted for bad answers that are wrong or stupid and therefore downvoting the one entirely correct answer + solution to the question is extremely counter productive, consider yourselves reported.
Florian Bösch
Why'd I hold a grudge for being downvoted? The voting is for the community to decide what they feel the best answer is, I am still just as free to use whichever method I want. -2 Reputation is not a big enough thing to get angry over in my opinion.
Teifion
It's not about the number of downvotes, it's that you should downvote things that are QFT http://stackoverflow.com/faq: "Above all, be honest. If you see misinformation, vote it down."
Florian Bösch
since I **know** my answer contains no misinformation, there's no legitimate reason to vote it down, and I'm assuming the same insight from others. Therefore I assume mallice, that's what gets me riled up.
Florian Bösch
Florian, saying that misinformation should be voted down is not the same thing as saying that misinformation is the only thing that should be voted down. Stop being so self-righteous. You aren't perfect. Your answer is fragile and inflexible compared with kigurai's.
Jim
Wikipedia attempts to block screen scrapers for a reason. Their servers have to do a lot of work to convert wikicode to HTML, when there are easier ways to get the article content. http://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler
Chris S
+1  A: 

The general solution I use for any site is to access the page using Firefox and, using an extension such as Firebug, record all details of the HTTP request including any cookies.

In your program (in this case in Python) you should try to send a HTTP request as similar as necessary to the one that worked from Firefox. This often includes setting the User-Agent, Referer and Cookie fields, but there may be others.

Liam
+2  A: 

Rather than trying to trick Wikipedia, you should consider using their High-Level API.

sligocki