views:

952

answers:

4

So I know this is a common question but there just doesn't seem to be any good answers for it.

I have a bucket with gobs (I have no clue how many) number of files in them. They are all within 2k a piece.

1) How do I figure out how many of these files I have WITHOUT listing them? I've used the s3cmd.rb, aws/s3, and jets3t stuff and the best I can find is a command to count the first 1000 records (really performing GETS on them).

I've been using jets3t's applet as well cause it's really nice to work with but even that I can't list all my objects cause I run out of heap space. (presumably cause it is peforming GETS on all of them and keeping them in memory)

2) How can I just delete a bucket? The best thing I've seen is a paralleized delete loop and that has problems cause sometimes it tries to delete the same file. This is what all the 'deleteall' commands that I've ran across do.

What do you guys do who have boasted about hosting millions of images/txts?? What happens when you want to remove it?

3) Lastly, are there alternate answers to this? All of these files are txt/xml files so I'm not even sure S3 is such a concern -- maybe I should move this to a document database of sorts??

What it boils down to is that the amazon S3 API is just straight out missing 2 very important operations -- COUNT and DEL_BUCKET. (actually there is a delete bucket command but it only works when the bucket is empty) If someone comes up with a method that does not suck to do these two operations I'd gladly give up lots of bounty.

UPDATE

Just to answer a few questions. The reason I ask this was I have been for the past year or so been storing hundreds of thousands, more like millions of 2k txt and xml documents. The last time, a couple of months ago, I wished to delete the bucket it literally took DAYS to do so because the bucket has to be empty before you can delete it. This was such a pain in the ass I am fearing ever having to do this again without API support for it.

UPDATE

this rocks the house!

http://github.com/SFEley/s3nuke/

I rm'd a good couple gigs worth of 1-2k files within minutes thank you Steve Eley

+1  A: 

I am most certainly not one of those 'guys do who have boasted about hosting millions of images/txts', as I only have a few thousand, and this may not be the answer you are looking for, but I looked at this a while back.

From what I remember, there is an API command called HEAD which gets information about an object rather than retrieving the complete object which is what GET does, which may help in counting the objects.

As far as deleting Buckets, at the time I was looking, the API definitely stated that the bucket had to be empty, so you need to delete all the objects first.

But, I never used either of these commands, because I was using S3 as a backup and in the end I wrote a few routines that uploaded the files I wanted to S3 (so that part was automated), but never bothered with the restore/delete/file management side of the equation. For that use Bucket Explorer which did all I need. In my case, it wasn't worth spending time when for $50 I can get a program that does all I need. There are probably others that do the same (eg CloudBerry)

In your case, with Bucket Explorer, you can right click on a bucket and select delete or right click and select properties and it will count the number of objects and the size they take up. It certainly does not download the whole object. (Eg the last bucket I looked it was 12Gb and around 500 files and it would take hours to download 12GB whereas the size and count is returned in a second or two). And if there is a limit, then it certainly isn't 1000.

Hope this helps.

sgmoore
A: 

1) Regarding your first question, you can list the items on a bucket without actually retrieving them. You can do that both with the SOAP and the REST API. As you can see, you can define the maximum number of items to list and the position to start the listing from (the marker). Read more about it here.

I do not know of any implementation of the paging, but especially for the REST interface it would be very easy to implement it in any language.

2) I believe the only way to delete a bucket is to first empty it from all items. See alse this question.

3) I would say that S3 is very well suited for storing a large number of files. It depends however on what you want to do. Do you plan to also store binary files? Do you need to perform any queries or just listing the files is enough?

kgiannakakis
even listing the keys at 1000 time or whatever the number was -- that took forever -- more than an afternoon and I finally killed it after I got bored and noticing that my heap was way too overfilled.
feydr
I don't think there is an API call to just get the number of items. Probably you've used a tool that also gets the contents of the files - that's why it took so long. Just use Fiddler or some other tool to send the GET bucket request (see the REST API link in my answer). It shouldn't take long to get the xml back. I am afraid that I don't have such a big bucket to test it myself.
kgiannakakis
A: 

"List" won't retrieve the data. I use s3cmd (a python script) and I would have done something like this:

s3cmd ls s3://foo | awk '{print $4}' | split -a 5 -l 10000 bucketfiles_
for i in bucketfiles_*; do xargs -n 1 s3cmd rm < $i & done

But first check how many bucketfiles_ files you get. There will be one s3cmd running per file.

It will take a while, but not days.

Thomas
I actually tried this method -- I've just come to the conclusion that S3 can not support deleting buckets right now, and that with it's horrendous access speed leaves an extremely bitter taste in my mouth for S3.
feydr
A: 

I've had the same problem with deleting hundreds of thousands of files from a bucket. It may be worthwhile to fire up an EC2 instance to run the parallel delete because the latency to S3 is low. I think there's some money to be made hosting a bunch of EC2 servers and charging people to delete buckets quickly. (At least until Amazon gets around to changing the API)

bertrandom