views:

2977

answers:

4

urllib.urlretrieve returns silently even if the file doesn't exist on the remote http server, it just saves a html page to the named file. For example:

urllib.urlretrieve('http://google.com/abc.jpg', 'abc.jpg')

just returns silently, even if abc.jpg doesn't exist on google.com server, the generated abc.jpg is not a valid jpg file, it's actually a html page . I guess the returned headers (a httplib.HTTPMessage instance) can be used to actually tell whether the retrieval successes or not, but I can't find any doc for httplib.HTTPMessage.

Can anybody provide some information about this problem?

+3  A: 

According to the documentation is is undocumented

to get access to the message it looks like you do something like:

a, b=urllib.urlretrieve('http://google.com/abc.jpg', r'c:\abc.jpg')

b is the message instance

Since I have learned that Python it is always useful to use Python's ability to be introspective when I type

dir(b)

I see lots of methods or functions to play with

And then I started doing things with b

for example

b.items()

Lists lots of interesting things, I suspect that playing around with these things will allow you to get the attribute you want to manipulate.

Sorry this is such a beginner's answer but I am trying to master how to use the introspection abilities to improve my learning and your questions just popped up.

Well I tried something interesting related to this-I was wondering if I could automatically get the output from each of the things that showed up in the directory that did not need parameters so I wrote:

needparam=[]
for each in dir(b):
    x='b.'+each+'()'
    try:
        eval(x)
        print x
    except:
        needparam.append(x)
PyNEwbie
+3  A: 

Consider using urllib2 if it possible in your case. It is more advanced and easy to use than urllib.

You can detect any HTTP errors easily:

>>> import urllib2
>>> resp = urllib2.urlopen("http://google.com/abc.jpg")
Traceback (most recent call last):
<<MANY LINES SKIPPED>>
urllib2.HTTPError: HTTP Error 404: Not Found

resp is actually HTTPResponse object that you can do a lot of useful things with:

>>> resp = urllib2.urlopen("http://google.com/")
>>> resp.code
200
>>> resp.headers["content-type"]
'text/html; charset=windows-1251'
>>> resp.read()
"<<ACTUAL HTML>>"
Alex Lebedev
Can urllib2 provide the caching behavior of urlretrieve though? Or would we have to reimplement it?
Kiv
See this awersome recipie from ActiveState: http://code.activestate.com/recipes/491261/We're using it in our current project, works flawlessly
Alex Lebedev
urlopen does not provide a hook function (to show progress bar for example) like urlretrieve.
Sridhar Ratnakumar
A: 

I ended up with my own retrieve implementation, with the help of pycurl it supports more protocols than urllib/urllib2, hope it can help other people.

import tempfile
import pycurl
import os

def get_filename_parts_from_url(url):
    fullname = url.split('/')[-1].split('#')[0].split('?')[0]
    t = list(os.path.splitext(fullname))
    if t[1]:
        t[1] = t[1][1:]
    return t

def retrieve(url, filename=None):
    if not filename:
        garbage, suffix = get_filename_parts_from_url(url)
        f = tempfile.NamedTemporaryFile(suffix = '.' + suffix, delete=False)
        filename = f.name
    else:
        f = open(filename, 'wb')
    c = pycurl.Curl()
    c.setopt(pycurl.URL, str(url))
    c.setopt(pycurl.WRITEFUNCTION, f.write)
    try:
        c.perform()
    except:
        filename = None
    finally:
        c.close()
        f.close()
    return filename
btw0
A: 

You can create a new URLopener (inherit from FancyURLopener) and throw exceptions or handle errors any way you want. Unfortunately, FancyURLopener ignores 404 and other errors. See this question:

http://stackoverflow.com/questions/1308542/how-to-catch-404-error-in-urllib-urlretrieve

Christian Davén