urllib2

http_proxy setting

I know this is simple.. I am jus missing something.. I give up!! #!/bin/sh export http_proxy='http://unblocksitesnow.info' rm -f index.html* strace -Ff -o /tmp/mm.log -s 200 wget 'http://slashdot.org' I have used different proxy servers.. to no avail.. I get some default page.. In /etc/wgetrc use_proxy = on Actually I am trying to us...

Retrieve cookie created using javascript in python

I've had a look at many tutorials regarding cookiejar, but my problem is that the webpage that i want to scape creates the cookie using javascript and I can't seem to retrieve the cookie. Does anybody have a solution to this problem? ...

Source interface with Python and urllib2

How do i set the source IP/interface with Python and urllib2? ...

Is it possible to fetch a https page via an authenticating proxy with urllib2 in Python 2.5?

I'm trying to add authenticating proxy support to an existing script, as it is the script connects to a https url (with urllib2.Request and urllib2.urlopen), scrapes the page and performs some actions based on what it has found. Initially I had hoped this would be as easy as simply adding a urllib2.ProxyHandler({"http": MY_PROXY}) as an ...

How do I get urllib2 to log ALL transferred bytes

I'm writing a web-app that uses several 3rd party web APIs, and I want to keep track of the low level request and responses for ad-hock analysis. So I'm looking for a recipe that will get Python's urllib2 to log all bytes transferred via HTTP. Maybe a sub-classed Handler? ...

Spoofing the origination IP address of an HTTP request

This only needs to work on a single subnet and is not for malicious use. I have a load testing tool written in Python that basically blasts HTTP requests at a URL. I need to run performance tests against an IP-based load balancer, so the requests must come from a range of IP's. Most commercial performance tools provide this function...

limit downloaded page size

Is there a way to limit amount of data downloaded by python's urllib2 module ? Sometimes I encounter with broken sites with sort of /dev/random as a page and it turns out that they use up all memory on a server. ...

pywikipedia bot with https and http authentication

I'm having trouble getting my bot to login to a MediaWiki install on the intranet. I believe it is due to the http authentication protecting the wiki. Facts: The wiki root is: https://local.example.com/mywiki/ When visiting the wiki with a web browser, a popup comes up asking for enterprise credentials (I assume this is basic access ...

How to make python urllib2 follow redirect and keep post method

I am using urllib2 to post data to a form. The problem is that the form replies with a 302 redirect. According to Python HTTPRedirectHandler the redirect handler will take the request and convert it from POST to GET and follow the 301 or 302. I would like to preserve the POST method and the data passed to the opener. I made an unsuccessf...

Python urllib2 problem?

I installed Python 2.6.2 earlier on a Windows XP machine and run the following code: import urllib2<br> import urllib<br><br> page = urllib2.Request('http://www.python.org/fish.html')&lt;br&gt; urllib2.urlopen( page )<br><br> I get the following error. Traceback (most recent call last):<br> File "C:\Python26\test3.py", line 6, in <...

Is it possible to peek at the data in a urllib2 response?

I need to detect character encoding in HTTP responses. To do this I look at the headers, then if it's not set in the content-type header I have to peek at the response and look for a "<meta http-equiv='content-type'>" header. I'd like to be able to write a function that looks and works something like this: response = urllib2.urlopen("...

cURL: https through a proxy

I need to make a cURL request to a https URL, but I have to go through a proxy as well. Is there some problem with doing this? I have been having so much trouble doing this with curl and php, that I tried doing it with urllib2 in Python, only to find that urllib2 cannot POST to https when going through a proxy. I haven't been able to ...

Is the implementation of response.info().getencoding() broken in urllib2?

I would expect the output of getencoding in the following python session to be "ISO-8859-1": >>> import urllib2 >>> response = urllib2.urlopen("http://www.google.com/") >>> response.info().plist ['charset=ISO-8859-1'] >>> response.info().getencoding() '7bit' This is with python version 2.6 ('2.6 (r26:66714, Aug 17 2009, 16:01:07) \n[G...

urlopen, BeautifulSoup and UTF-8 Issue

I am just trying to retrieve a web page, but somehow a foreign character is embedded in the HTML file. This character is not visible when I use "View Source." isbn = 9780141187983 url = "http://search.barnesandnoble.com/booksearch/isbninquiry.asp?ean=%s" % isbn opener = urllib2.build_opener() url_opener = opener.open(url) page = url_ope...

Fixing broken urls

Does anyone know of a library for fixing "broken" urls. When I try to open a url such as http://www.domain.com/../page.html http://www.domain.com//page.html http://www.domain.com/page.html#stuff urllib2.urlopen chokes and gives me an HTTPError traceback. Does anyone know of a library that can fix these sorts of things? ...

Proxy with urllib2

I open the urls with site = urllib2.urlopen('http://google.com') And what I wanna do is connect the same way with a proxy I got somwhere telling me site = urllib2.urlopen('http://google.com', proxies={'http':'127.0.0.1'}) but that hadent worked either I know urllib2 has something like a proxy handler but I cant recall that function ...

Why I get urllib2.HTTPError with urllib2 and no errors with urllib?

Hi, I have the following simple code: import urllib2 import sys sys.path.append('../BeautifulSoup/BeautifulSoup-3.1.0.1') from BeautifulSoup import * page='http://en.wikipedia.org/wiki/Main_Page' c=urllib2.urlopen(page) This code generates the following error messages: c=urllib2.urlopen(page) File "/usr/lib64/python2.4/urllib2....

How to download any(!) webpage with correct charset in python?

Problem When screen-scraping a webpage using python one has to know the character encoding of the page. If you get the character encoding wrong than your output will be messed up. People usually use some rudimentary technique to detect the encoding. They either use the charset from the header or the charset defined in the meta tag or t...

urllib2: submitting a form and then redirecting

My goal is to come up with a portable urllib2 solution that would POST a form and then redirect the user to what comes out. The POSTing part is simple: request = urllib2.Request('https://some.site/page', data=urllib.urlencode({'key':'value'})) response = urllib2.urlopen(request) Providing data sets request type to POST. Now, what I su...

urlopen error 10045, 'address already in use' while downloading in Python 2.5 on Windows

I'm writing code that will run on Linux, OS X, and Windows. It downloads a list of approximately 55,000 files from the server, then steps through the list of files, checking if the files are present locally. (With SHA hash verification and a few other goodies.) If the files aren't present locally or the hash doesn't match, it downloads t...