views:

92

answers:

1

I currently use WebFaction for my hosting with the basic package that gives us 80MB of RAM. This is more than adequate for our needs at the moment, apart from our backups. We do our own backups to S3 once a day.

The backup process is this: dump the database, tar.gz all the files into one backup named with the correct date of the backup, upload to S3 using the python library provided by Amazon.

Unfortunately, it appears (although I don't know this for certain) that either my code for reading the file or the S3 code is loading the entire file in to memory. As the file is approximately 320MB (for today's backup) it is using about 320MB just for the backup. This causes WebFaction to quit all our processes meaning the backup doesn't happen and our site goes down.

So this is the question: Is there any way to not load the whole file in to memory, or are there any other python S3 libraries that are much better with RAM usage. Ideally it needs to be about 60MB at the most! If this can't be done, how can I split the file and upload separate parts?

Thanks for your help.

This is the section of code (in my backup script) that caused the processes to be quit:

filedata = open(filename, 'rb').read()
content_type = mimetypes.guess_type(filename)[0]
if not content_type:
    content_type = 'text/plain'
print 'Uploading to S3...'
response = connection.put(BUCKET_NAME, 'daily/%s' % filename, S3.S3Object(filedata), {'x-amz-acl': 'public-read', 'Content-Type': content_type})
A: 

don't read the whole file into your filedata variable. you could use a loop and then just read ~60 MB and submit them to amazon.

backup = open(filename, 'rb')
while True:
    part_of_file = backup.read(60000000) # not exactly 60 MB....
    response = connection.put() # submit part_of_file here to amazon
flurin
Thanks for the response. Will this put the file up in separate parts and da i have to change the filename each loop or will the data all be concatenated together on the Amazon end? Is there anything special I have to do in the connection.put() to make it do this?
danpalmer
this would submit each part of a file as one single file. I don't know the amazon api but maybe there is a way to concatenate all file parts to a single file on the amazon side.under connection.put() you could increment a number inside the filename. (backup_20100317_part1, part2, part3...)
flurin
Thank you for the help, I know this is the way I want to do the backups, but I have a few questions about this method. Firstly, if I read(60000000) and there is not that much data left to read, what will happen? Also, can I just have this: x = file.read(60000000)upload xwhile x: x = file.read(60000000) upload xThe reason I ask is because it is difficult for me to test this script on my machine and I can't test it on the server as if it fails it could bring the sites down.Thanks for the help.
danpalmer
you can find infos to read() here: http://docs.python.org/library/stdtypes.html#file-objectsread() should return "" if ther is no more data to read. if there is less than 60000.. in the file you get the rest.I strongly suggest you to test it! why not submit some file of your computer to amazon just to check whether it works. (test also the restore!) backup = open(filename, 'rb') while True: part_of_file = backup.read(60000000) # not exactly 60 MB.... if part_of_file == "": break # we have read all else: #submit this part to amazon
flurin
Ok. I have the script backing up in parts perfectly. I did a dry run on my computer with no issues. However I am having difficulty re-combining the files.My current process is this:open new file for writing,open parts 1, 2, n...,backup.write(part1.read() + part2.read()...),backup.close().I can decompress the file, but when I un-tar the other files inside it, they are corrupt, the directories show as files.I can confirm that the backups that were not split work perfectly so it is because of the splitting.Any thoughts?
danpalmer