views:

496

answers:

2

I am trying to retrieve a 500mb file using Python, and I have a script which uses urllib.urlretrieve(). There seems to some network problem between me and the download site, as this call consistently hangs and fails to complete. However, using wget to retrieve the file tends to work without problems. What is the difference between urlretrieve() and wget that could cause this difference?

+1  A: 

If you are using:

page = urllib.retrieve('http://example.com/really_big_file.html')

you are creating a 500mb string which may well tax your machine, make it slow, and cause the connection to timeout. If so, you should be using:

(filename, headers) = urllib.retrieve('http://...', 'local_outputfile.html')

which won't tax the interpreter.

It is worth noting urllib.retrieve() uses urllib.urlopen() which is now deprecated.

msw
Good point, although I am using the longer version that saves to a file.
jrdioko
+6  A: 

The answer is quite simple. Python's urllib and urllib2 are nowhere near as mature and robust as they could be. Even better than wget in my experience is cURL. I've written code that downloads gigabytes of files over HTTP with file sizes ranging from 50 KB to over 2 GB. To my knowledge, cURL is the most reliable piece of software on the planet right now for this task. I don't think python, wget, or even most web browsers can match it in terms of correctness and robustness of implementation. On a modern enough python using urllib2 in the exact right way, it can be made pretty reliable, but I still run a curl subprocess and that is absolutely rock solid.

Another way to state this is that cURL does one thing only and it does it better than any other software because it has had much more development and refinement. Python's urllib2 is serviceable and convenient and works well enough for small to average workloads, but cURL is way ahead in terms of reliability.

Also, cURL has numerous options to tune the reliability behavior including retry counts, timeout values, etc.

Peter Lyons