I'm trying to download (and save) a binary file from the web using Python 2.6 and urllib.
As I understand it, read(), readline() and readlines() are the 3 ways to read a file-like object. Since the binary files aren't really broken into newlines, read() and readlines() read teh whole file into memory.
Is choosing a random read() buffer size the most efficient way to limit memory usage during this process?
i.e.
import urllib
import os
title = 'MyFile'
downloadurl = 'http://somedomain.com/myfile.avi'
webFile = urllib.urlopen(downloadurl)
mydirpath = os.path.join('c:', os.sep,'mydirectory',\
downloadurl.split('/')[-1])
if not os.path.exists(mydirpath):
print "Downloading...%s" % title
localFile = open(mydirpath, 'wb')
data = webFile.read(1000000) #1MB at a time
while data:
localFile.write(data)
data = webFile.read(1000000) #1MB at a time
webFile.close()
localFile.close()
print "Finished downloading: %s" % title
else:
print "%s already exists." % mydirypath
I chose read(1000000) arbitrarily because it worked and kept RAM usage down. I assume if I was working with a raw network buffer choosing a random amount would be bad since the buffer might run dry if the transfer rate was too low. But it seems urllib is already handling lower level buffering for me.
With that in mind, is choosing an arbitrary number fine? Is there a better way?
Thanks.