Is there a way to limit amount of data downloaded by python's urllib2 module ? Sometimes I encounter with broken sites with sort of /dev/random as a page and it turns out that they use up all memory on a server.
+2
A:
urllib2.urlopen
returns a file-like object, and you can (at least in theory) .read(N)
from such an object to limit the amount of data returned to N bytes at most.
This approach is not entirely fool-proof, because an actively-hostile site may go to quite some lengths to fool a reasonably trusty received, like urllib2's default opener; in this case, you'll need to implement and install your own opener that knows how to guard itself against such attacks (for example, getting no more than a MB at a time from the open socket, etc, etc).
Alex Martelli
2009-08-03 22:34:24