ansaurus

Question

Difference between Python urllib.urlretrieve() and wget

Answer 1

+1 A:

If you are using:

page = urllib.retrieve('http://example.com/really_big_file.html')

you are creating a 500mb string which may well tax your machine, make it slow, and cause the connection to timeout. If so, you should be using:

(filename, headers) = urllib.retrieve('http://...', 'local_outputfile.html')

which won't tax the interpreter.

It is worth noting urllib.retrieve() uses urllib.urlopen() which is now deprecated.

msw 2010-05-06 00:21:11

Good point, although I am using the longer version that saves to a file.

jrdioko 2010-05-06 18:36:40

Answer 2

+6 A:

The answer is quite simple. Python's urllib and urllib2 are nowhere near as mature and robust as they could be. Even better than wget in my experience is cURL. I've written code that downloads gigabytes of files over HTTP with file sizes ranging from 50 KB to over 2 GB. To my knowledge, cURL is the most reliable piece of software on the planet right now for this task. I don't think python, wget, or even most web browsers can match it in terms of correctness and robustness of implementation. On a modern enough python using urllib2 in the exact right way, it can be made pretty reliable, but I still run a curl subprocess and that is absolutely rock solid.

Another way to state this is that cURL does one thing only and it does it better than any other software because it has had much more development and refinement. Python's urllib2 is serviceable and convenient and works well enough for small to average workloads, but cURL is way ahead in terms of reliability.

Also, cURL has numerous options to tune the reliability behavior including retry counts, timeout values, etc.

Peter Lyons 2010-05-12 02:26:25

ansaurus

tags:

views:

answers:

Difference between Python urllib.urlretrieve() and wget

related questions