ansaurus

Question

Which is best in Python: urllib2, PycURL or mechanize?

Answer 1

+9 A:

urllib2 is found in every Python install everywhere, so is a good base upon which to start.
PycURL is useful for people already used to using libcurl, exposes more of the low-level details of HTTP, plus it gains any fixes or improvements applied to libcurl.
mechanize is used to persistently drive a connection much like a browser would.

It's not a matter of one being better than the other, it's a matter of choosing the appropriate tool for the job.

Ignacio Vazquez-Abrams 2010-03-05 10:21:12

Answer 2

+5 A:

I think this talk (at pycon 2009), has the answers for what you're looking for (Asheesh Laroia has lots of experience on the matter). And he points out the good and the bad from most of your listing

Scrape the Web: Strategies for programming websites that don't expect it (Part 1 of 3)
Scrape the Web: Strategies for programming websites that don't expect it (Part 2 of 3)
Scrape the Web: Strategies for programming websites that don't expect it (Part 3 of 3)

From the PYCON 2009 schedule:

Do you find yourself faced with websites that have data you need to extract? Would your life be simpler if you could programmatically input data into web applications, even those tuned to resist interaction by bots?

We'll discuss the basics of web scraping, and then dive into the details of different methods and where they are most applicable.

You'll leave with an understanding of when to apply different tools, and learn about a "heavy hammer" for screen scraping that I picked up at a project for the Electronic Frontier Foundation.

Atendees should bring a laptop, if possible, to try the examples we discuss and optionally take notes.

Update: Asheesh Laroia has updated his presentation for pycon 2010

PyCon 2010: Scrape the Web: Strategies for programming websites that don't expected it

* My motto: "The website is the API."
* Choosing a parser: BeautifulSoup, lxml, HTMLParse, and html5lib.
* Extracting information, even in the face of bad HTML: Regular expressions, BeautifulSoup, SAX, and XPath.
* Automatic template reverse-engineering tools.
* Submitting to forms.
* Playing with XML-RPC
* DO NOT BECOME AN EVIL COMMENT SPAMMER.
* Countermeasures, and circumventing them:
      o IP address limits
      o Hidden form fields
      o User-agent detection
      o JavaScript
      o CAPTCHAs 
* Plenty of full source code to working examples:
      o Submitting to forms for text-to-speech.
      o Downloading music from web stores.
      o Automating Firefox with Selenium RC to navigate a pure-JavaScript service. 
* Q&A; and workshopping
* Use your power for good, not evil.

Diego Castro 2010-03-05 10:48:30

I wish I could also accept this answer. Great presentation!

bigredbob 2010-03-05 21:08:48

Two or three sentences summarizing the talk's recommendations would be great, for those without the time to listen to it. :-)

Brandon Craig Rhodes 2010-10-27 01:05:27

Answer 3

A:

Don't worry about "last updated". HTTP hasn't changed much in the last few years ;)

urllib2 is best (as it's inbuilt), then switch to mechanize if you need cookies from Firefox. mechanize can be used as a drop-in replacement for urllib2 - they have similar methods etc. Using Firefox cookies means you can get things from sites (like say StackOverflow) using your personal login credentials. Just be responsible with your number of requests (or you'll get blocked).

PycURL is for people who need all the low level stuff in libcurl. I would try the other libraries first.

wisty 2010-03-05 11:09:02

Answer 4

+1 A:

Urllib2 only supports HTTP GET and POST, there might be workarounds, but If your app depends on other HTTP verbs, you will probably prefer a different module.

mikerobi 2010-03-05 14:10:29

Answer 5

A:

Every python library that speaks HTTP has its own advantages.

Use the one that has the minimum amount of features necessary for a particular task.

Your list is missing at least urllib3 - a cool third party HTTP library which can reuse a HTTP connection, thus speeding up greatly the process of retrieving multiple URLs from the same site.

jedi_coder 2010-08-04 03:27:06

ansaurus

tags:

views:

answers:

Which is best in Python: urllib2, PycURL or mechanize?

related questions