tags:

views:

118

answers:

4

I am going to hosting for files that user submits. I need to grab some data from the file and then move it to some directory.

There two points of interest for the lifetime of this file. The first is when the data is being abstracted and the second is when the file is archived so that it can be shared.

When data is being abstracted, I've thought that renaming the file to something unique or append a unique string to filename to keep it from overwriting other existing files.

When the file is going to be archived, I've thought of three strategies. One is to keep all files uploaded from a certain data in one folder. (2006/sept/04, 2008/jan/05) The other is to keep a folder and keep filling it until some max number of files I want to keep in folder and then create another one (/folder001/, /folder002/, /folder003/, etc..). Another one is to create subfolders once they reach some threshold. So like (/j/jd/jde/jdelator) I've seen this in unix not sure how to explain this.

The questions I have is what kind of strategies you guys have found useful or used?

+1  A: 

I've used a relational database which tags ID's (int) to uuids that are the name of the files. This way it doesn't matter how they are on disk. It helps me obfuscate the files. Also, I can then use JOINs to "rename" the file arbitrarily. Also, I can use different file "names." It all depends on your app and where it is running.

Sargun Dhillon
+1  A: 

Though it depends on your application and etc., I would suggest keeping file repository scheme very simple for now, and decide on more elaborate strategy later. In other words, you make kind of "managed chaos" for a while; structure and strategy will come up later, when you will find out all requirements and domain specifics. By keeping simple, you can change everything easily.

Anyways, change is inevitable, the best thing you can do now is to choose some strategy and to document everything.

+2  A: 

I'd vote with guid in a database and then use the Content-Disposition header to name it back to the original filename if necessary. One thing I would advocate is that the folders you use are stored outside of the web root; you don't want users uploading files into your application folders.

blowdart
+2  A: 

When data is being abstracted, I would choose something like : filename + millisec(); It is unlikely that two call to millisec will be the same, and filename is more userfriendly when accessing.

The date strategy can be convenient if you decide to remove old and unused files : you only have to get the 2006 folder, and remove all that has not been accessed in the last year, according to your log. This also can be a good indication for your users, as they will know if it is a fresh file or not. The folderXYZ is only a variant of this one, replacing date with a tag each N files.

The threshold subfolders helps you to keep the number of entries of your directories low, so access is faster. Note that this solution requires to sometimes move files (and then break some url if not mapped) when a particular directory grows.

Another possibility is to use a DB with UID corresponding to filename location, and accessing file through http://server.com/UID/filename.txt . This way, the user saves the file as "filename.txt" which is convenient for him, and you know with the URL where to find the file (using the DB to transform UID to location). Note that the UID can be a checksum (MD5, SHA-1) to handle duplicates of the same file.

ofaurax