urllib2

Stream large binary files with urllib2 to file

I use the following code to stream large files from the Internet into a local file: fp = open(file, 'wb') req = urllib2.urlopen(url) for line in req: fp.write(line) fp.close() This works but it downloads quite slowly. Is there a faster way? (The files are large so I don't want to keep them in memory.) ...

how to apply "catch-all" exception clause to complex python web-scraping script?

Hi, I've got a list of 100 websites in CSV format. All of the sites have the same general format, including a large table with 7 columns. I wrote this script to extract the data from the 7th column of each of the websites and then write this data to file. The script below partially works, however: opening the output file (after running ...

why is urllib2 missing table fields which I can see in the Firefox source?

the html that I am receiving from urllib2 is missing dozens of fields of data that I can see when I view the source of the URL in Firefox. Any advice would be much appreciated. Here is what it looks like: from FireFox view source: # ...<td class=td6>as</td></tr></thead>|ManyFields|<br></div><div id="c1">... from urllib2 return html...

How do I ask for an authenticated url directly with python

I want to get to an authenticated page using urllib2. I'm hoping there's a hack to do it directly. something like: urllib2.urlopen('http://username:pwd@server/page') If not, how do I use authentication? ...

How do I know what's the realm and uri of a site

I want to use python's urllib2 with authentication and I need the realm and uri of a url. How do I get it? thanks ...

non-blocking read/log from an http stream

I have a client that connects to an HTTP stream and logs the text data it consumes. I send the streaming server an HTTP GET request... The server replies and continuously publishes data... It will either publish text or send a ping (text) message regularly... and will never close the connection. I need to read and log the data it c...

Trace/BPT trap when calling urllib.urlopen

For some reason I'm getting a Trace/BPT trap error when calling urllib.urlopen. I've tried both urllib and urllib2 with identical results. Here is the code which throws the error: def get_url(url): from urllib2 import urlopen if not url or not url.startswith('http://'): return None return urlopen(url).read() # FIXME! I sho...

I am downloading a file using Python urllib2. How do I check how large the file size is?

And if it is large...then stop the download? I don't want to download files that are larger than 12MB. request = urllib2.Request(ep_url) request.add_header('User-Agent',random.choice(agents)) thefile = urllib2.urlopen(request).read() ...

urlretrieve returns an empty file

I'm trying to use urlretrieve to download files from urls that take the form: http://example.com/download.php?id=6456&amp;name=foo yet for some reason I just get an empty response. I've tried the method suggested in this question didn't seem to help because remotefile.info() doesn't contain the key 'content-disposition', only ['...

Get json data via url and use in python (simplejson)

I imagine this must have a simple answer, but I am struggling: I want to take a url (which outputs json) and get the data in a usable dictionary in python. I am stuck on the last step. >>> import urllib2 >>> import simplejson >>> req = urllib2.Request("http://vimeo.com/api/v2/video/38356.json", None, {'user-agent':'syncstream/vimeo'}) ...

Grab some ofx data with python

I was trying to use http://www.jongsma.org/gc/scripts/ofx-ba.py to grab my bank account information from wachovia. Having no luck, I decided that I would just try to manually construct some request data using this example So, I have this file that I want to use as the request data. Let's call it req.ofxsgml: FXHEADER:100 DATA:OFXSGML ...

Python fetching <title>

I want to fetch the title of a webpage which I open using urllib2. What is the best way to do this, to parse the html and find what I need (for now only the -tag but might need more in the future). Is there a good parsing lib for this purpose? ...

Open web page with custom cookies in Python

Hi everyone. For example, I have cookies my_cookies = {'name': 'Albert', 'uid': '654897897564'} and I want to open page http://website.com opener = urllib2.build_opener(urllib2.HTTPCookieProcessor()) opener.addheaders.append(('User-agent', 'Mozilla/5.0 (compatible)')) opener.open('http://website.com').read() How I can do this with...

Python: appengine urllib2 headers from a 302

A normal urllib2 works fine: >>> import urllib2 >>> r = urllib2.urlopen(u"http://bit.ly/4ovTZw") >>> r.geturl() 'http://www.writing.com/main/handler/action/show_document/item_id/933413.mp3' >>> r.headers.get("Content-Type") 'audio/mpeg' But in appengine, the same code shows text/html. def get(self): r = urllib2.urlopen(u"http://b...

Convert gzipped data fetched by urllib2 to HTML

I currently use mechanize to read gzipped web page as below: br = mechanize.Browser() br.set_handle_gzip(True) response = br.open(url) data = response.read() I wonder how to decompress gzipped data fetched by urllib2 to HTML text? req = urllib2.Request(url) opener = urllib2.build_opener() response = opener.open(req) data = response.r...

Getting a myriad of socket issues while writing desktop Python bot.

The issue stems from the OAuth authentication portion of my code. I truncated a bunch of it and cut at the part where I get my error. My specific error is "gaierror: (11001, 'getaddrinfo failed'". I really have no idea why. I'm using Leah Culver's OAuth library (http://oauth.googlecode.com/svn/code/python/oauth/). Pretty much following t...

Client Digest Authentication Python with URLLIB2 will not remember Authorization Header Information

I am trying to use Python to write a client that connects to a custom http server that uses digest authentication. I can connect and pull the first request without problem. Using TCPDUMP (I am on MAC OS X--I am both a MAC and a Python noob) I can see the first request is actually two http requests, as you would expect if you are famili...

I set a proxy server on urllib2, and then I can't change it.

Like the title says, my code basically does this: set proxy, test proxy, do some cool stuff But after the proxy is set the first time, it sticks that way, never changing. This is the failing code: # Pick proxy r = random.randint(0, len(proxies) - 1) proxy = proxies[r] print proxy # Setup proxy l_proxy_support ...

Python: urllib2 multipart/form-data and proxies

The Objective: A script which cycles through a list of proxies and sends a post request, containing a file to a PHP page on my server, which then calculates delivery time. It's a pretty useless script, but I am using it to teach myself about urllib2. The Problem: So far I have got multipart/form-data sending correctly using Poster, but ...

I just want to download this URL...but it is giving me an error! ...unicode.. (Python)

theurl = 'http://bit.ly/6IcCtf/' urlReq = urllib2.Request(theurl) urlReq.add_header('User-Agent',random.choice(agents)) urlResponse = urllib2.urlopen(urlReq) htmlSource = urlResponse.read() if unicode == 1: #print urlResponse.headers['content-type'] #encoding=urlResponse.headers['content-type'].split('charset=')[-1] #htmlSour...