views:

45

answers:

1

I want to have a robot fetch a URL every hour, but if the site's operator is malicious he could have his server send me a 1 GB file. Is there a good way to limit downloading to, say, 100 KB and stop after that limit?

I can imagine writing my own connection handler from scratch, but I'd like to use urllib2 if at all possible, just specifying the limit somehow.

Thanks!

+4  A: 

This is probably what you're looking for:

import urllib

def download(url, bytes = 1024):
    """Copy the contents of a file from a given URL
    to a local file.
    """
    webFile = urllib.urlopen(url)
    localFile = open(url.split('/')[-1], 'w')
    localFile.write(webFile.read(bytes))
    webFile.close()
    localFile.close()
KushalP
read() takes a bytes argument? That's fantastic, it's exactly what I wanted, thank you very much! I haven't been able to find it in the docs...
Stavros Korokithakis
http://docs.python.org/library/stdtypes.html#file.read (The most important methods of python's file objects are implemented by pretty much all file-like objects in python.)
Forest
Thanks, I knew about file.read() but didn't realize that the same semantics are implemented in url.read()...
Stavros Korokithakis
Last time I tried this technique it failed because it was actually impossible to read from the HTTP server only specified amount of data, i.e. you implicitly read all HTTP response and only then read first N bytes out of it. So at the end you ended up downloading the whole 1Gb malicious response
Konstantin