views:

242

answers:

5

How do I download a file with progress report using python but without supplying a filename.

I have tried urllib.urlretrieve but I seem to have to supply a filename for the downloaded file to save as.

So for example:

I don't want to supply this:

urllib.urlretrieve("http://www.mozilla.com/products/download.html?product=firefox-3.6.3&os=win&lang=en-US", "/tmp/firefox.exe")

just this:

urllib.urlretrieve("http://www.mozilla.com/products/download.html?product=firefox-3.6.3&os=win&lang=en-US", "/tmp/")

but if I do I get this error:

IOError: [Errno 21] Is a directory: '/tmp'

Also unable to get the filename from some URL Example:

http://www.mozilla.com/products/download.html?product=firefox-3.6.3&os=win&lang=en-US

+2  A: 

There is urlopen, which creates a file-like object that can be used to read the data without saving it to a local file:

from urllib2 import urlopen

f = urlopen("http://example.com/")
for line in f:
  print len(line)
f.close()

(I'm not really sure if this is what you're asking for.)

sth
Not quite, I have just edit my question with an example hope this helpsThank for the reply
Samuel Taylor
+1  A: 

edited after the question was clarified...

urlparse.urlsplit will take the url that you are opening and split it into its component parts, then you can take the path portion and use the last /-delimited chunk as the filename.

import urllib, urlparse

split = urlparse.urlsplit(url)
filename = "/tmp/" + split.path.split("/")[-1]
urllib.urlretrieve(url, filename)
teepark
Samuel Taylor
A: 

The URL you're specifying doesn't refer to a file at all. It's a redirect to a web page, that runs some javascript, that causes your web browser to download the file. The actual address my browser was directed to (a mirror) from the URL in question is:

http://mozilla.mirrors.evolva.ro//firefox/releases/3.6.3/win32/en-US/Firefox%20Setup%203.6.3.exe

I believe that there are two ways that web servers specify the name of the file for downloads;

  1. The final segment of the URL path
  2. The Content-Disposition header, which can specify some other filename to use

For the file you want to download I think you only need the last path segment of the URL (but using the actual URL of the file, not the web page that chooses which mirrored file to use). But for some downloads you'd need to get the filename to use from the Content-Disposition header.

Matt Anderson
A: 

A quick look at the javascript on the firefox page reveals:

// 2. Build download.mozilla.org URL out of those vars.
download_url = "http://download.mozilla.org/?product=";
download_url += product + '&os=' + os + '&lang=' + lang;

So just change your url from:

http://www.mozilla.com/products/download.html?product=firefox-3.6.3&os=win&lang=en-US

to

http://download.mozilla.org/?product=firefox-3.6.3&os=win&lang=en-US

So now I will check the headers to see what we really get...

$ curl -I "http://download.mozilla.org/?product=firefox-3.6.3&os=win&lang=en-US"
HTTP/1.1 302 Found
Server: Apache
X-Backend-Server: pp-app-dist09
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0, private
Content-Type: text/html; charset=UTF-8
Date: Sat, 08 May 2010 21:02:50 GMT
Location: http://mozilla.mirror.ac.za/firefox/releases/3.6.3/win32/en-US/Firefox Setup 3.6.3.exe
Pragma: no-cache
Transfer-Encoding: chunked
Connection: Keep-Alive
Set-Cookie: dmo=10.8.84.200.1273352570769772; path=/; expires=Sun, 08-May-11 21:02:50 GMT
X-Powered-By: PHP/5.1.6

So this actually is a 302 redirect, so now use what is in the Location header as your new url to get the actual file. You'll need to figure out how to do a request and read the headers on your own(sorry I don't have much time). After you parse the location header, you can then strip out the rest of the location using regex to get the filename to save the file to as well:

>>> location = 'http://mozilla.mirror.ac.za/firefox/releases/3.6.3/win32/en-US/Firefox Setup 3.6.3.exe'
>>> re.match('^.*/(.*?)$', location).groups()[0]
'Firefox Setup 3.6.3.exe'

So to get the actual filename you will need to follow the 302 yourself. The code necessary for this I will leave up to you, but hopefully this will point you in the right direction.

Kekoa
A: 

urlgrabber.urlgrab() will use the basename of the URL passed to it as the filename. Note that it will ignore the Content-Disposition header.

Ignacio Vazquez-Abrams