ansaurus

Question

Answer 1

+1 A:

Try changing the user agent header you are sending in your request to something like: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008072820 Ubuntu/8.04 (hardy) Firefox/3.0.1 (Linux Mint)

Vasil 2008-09-23 09:41:22

Answer 2

A:

You don't need to impersonate a browser user-agent; any user-agent at all will work, just not a blank one.

Gurch 2008-09-23 09:48:05

urllib and urllib2 both send a user agent

Teifion 2008-09-23 09:58:42

Answer 3

+13 A:

It is not a solution to the specific problem. But it might be intersting for you to use the mwclient library (http://botwiki.sno.cc/wiki/Python:Mwclient) instead. That would be so much easier. Especially since you will directly get the article contents which removes the need for you to parse the html.

I have used it myself for two projects, and it works very well.

kigurai 2008-09-23 09:49:44

Using third party libraries for what can easily be done with buildin libraries in a couple lines of code isn't good advice.

Florian Bösch 2008-09-23 10:18:40

Since mwclient uses the mediawiki api it will require no parsing of the content. And I am guessing the original poster wants the content, and not the raw html with menus and all.

kigurai 2008-09-23 10:52:45

Answer 4

+13 A:

You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent.

Straight from the examples

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&amp;printable=yes')
page = infile.read()

Florian Bösch 2008-09-23 09:50:39

Gurch, Teifion, holding a grudge because you get downvoted for bad answers that are wrong or stupid and therefore downvoting the one entirely correct answer + solution to the question is extremely counter productive, consider yourselves reported.

Florian Bösch 2008-09-23 10:02:10

Why'd I hold a grudge for being downvoted? The voting is for the community to decide what they feel the best answer is, I am still just as free to use whichever method I want. -2 Reputation is not a big enough thing to get angry over in my opinion.

Teifion 2008-09-23 10:08:14

It's not about the number of downvotes, it's that you should downvote things that are QFT http://stackoverflow.com/faq: "Above all, be honest. If you see misinformation, vote it down."

Florian Bösch 2008-09-23 10:14:46

since I **know** my answer contains no misinformation, there's no legitimate reason to vote it down, and I'm assuming the same insight from others. Therefore I assume mallice, that's what gets me riled up.

Florian Bösch 2008-09-23 10:15:52

Florian, saying that misinformation should be voted down is not the same thing as saying that misinformation is the only thing that should be voted down. Stop being so self-righteous. You aren't perfect. Your answer is fragile and inflexible compared with kigurai's.

Jim 2008-09-23 11:10:13

Wikipedia attempts to block screen scrapers for a reason. Their servers have to do a lot of work to convert wikicode to HTML, when there are easier ways to get the article content. http://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler

Chris S 2010-08-12 17:49:53

Answer 5

+1 A:

The general solution I use for any site is to access the page using Firefox and, using an extension such as Firebug, record all details of the HTTP request including any cookies.

In your program (in this case in Python) you should try to send a HTTP request as similar as necessary to the one that worked from Firefox. This often includes setting the User-Agent, Referer and Cookie fields, but there may be others.

Liam 2008-09-23 09:51:31

Answer 6

+2 A:

Rather than trying to trick Wikipedia, you should consider using their High-Level API.

sligocki 2009-06-11 11:14:20

ansaurus

tags:

views:

answers:

Fetch a Wikipedia article with Python

related questions