views:

358

answers:

6

What's the Fastest way to get a large number of files (relatively small 10-50kB) from Amazon S3 from Python? (In the order of 200,000 - million files).

At the moment I am using boto to generate Signed URLs, and using PyCURL to get the files one by one.

Would some type of concurrency help? PyCurl.CurlMulti object?

I am open to all suggestions. Thanks!

+2  A: 

I don't know anything about python, but in general you would want to break the task down into smaller chunks so that they can be run concurrently. You could break it down by file type, or alphabetical or something, and then run a separate script for each portion of the break down.

gburgoon
A: 

I've been using txaws with twisted for S3 work, though what you'd probably want is just to get the authenticated URL and use twisted.web.client.DownloadPage (by default will happily go from stream to file without much interaction).

Twisted makes it easy to run at whatever concurrency you want. For something on the order of 200,000, I'd probably make a generator and use a cooperator to set my concurrency and just let the generator generate every required download request.

If you're not familiar with twisted, you'll find the model takes a bit of time to get used to, but it's oh so worth it. In this case, I'd expect it to take minimal CPU and memory overhead, but you'd have to worry about file descriptors. It's quite easy to mix in perspective broker and farm the work out to multiple machines should you find yourself needing more file descriptors or if you have multiple connections over which you'd like it to pull down.

Dustin
+1  A: 

In the case of python, as this is IO bound, multiple threads will use of the CPU, but it will probably use up only one core. If you have multiple cores, you might want to consider the new multiprocessor module. Even then you may want to have each process use multiple threads. You would have to do some tweaking of number of processors and threads.

If you do use multiple threads, this is a good candidate for the Queue class.

Kathy Van Stone
A: 

what about thread + queue, I love this article: Practical threaded programming with Python

sunqiang
+1  A: 

You might consider using s3fs, and just running concurrent file system commands from Python.

Brent Newey
A: 

Each job can be done with appropriate tools :)

You want use python for stress testing S3 :), so I suggest find a large volume downloader program and pass link to it.

On Windows I have experience for installing ReGet program (shareware, from http://reget.com) and creating downloading tasks via COM interface.

Of course there may other programs with usable interface exists.

Regards!

Denis Barmenkov