views:

432

answers:

2

I'm trying to download (and save) a binary file from the web using Python 2.6 and urllib.

As I understand it, read(), readline() and readlines() are the 3 ways to read a file-like object. Since the binary files aren't really broken into newlines, read() and readlines() read teh whole file into memory.

Is choosing a random read() buffer size the most efficient way to limit memory usage during this process?

i.e.

import urllib
import os

title = 'MyFile'
downloadurl = 'http://somedomain.com/myfile.avi'
webFile = urllib.urlopen(downloadurl)
mydirpath = os.path.join('c:', os.sep,'mydirectory',\
                         downloadurl.split('/')[-1])

if not os.path.exists(mydirpath):
    print "Downloading...%s" % title
    localFile = open(mydirpath, 'wb')
    data = webFile.read(1000000) #1MB at a time
    while data:
        localFile.write(data)
        data = webFile.read(1000000) #1MB at a time
    webFile.close()
    localFile.close()
    print "Finished downloading: %s" % title
else:
    print "%s already exists." % mydirypath

I chose read(1000000) arbitrarily because it worked and kept RAM usage down. I assume if I was working with a raw network buffer choosing a random amount would be bad since the buffer might run dry if the transfer rate was too low. But it seems urllib is already handling lower level buffering for me.

With that in mind, is choosing an arbitrary number fine? Is there a better way?

Thanks.

+1  A: 

You should use urllib.urlretrieve for this. It will handle everything for you.

Paolo Bergantino
A: 

Instead of using your own read-write loop, you should probably check out the shutil module. The copyfileobj method will let you define the buffering. The most efficient method varies from situation to situation. Even copying the same source file to the same destination may vary due to network issues.

Dingo