views:

235

answers:

5

I'm writing a "file sharing hosting" and I want to rename all the files when uploading to a unique name and somehow keep track of the names on the database. Since I don't want two or more files having same name (which is surely impossible), I'm looking for an algorithm which based on key or something generates random names for me.

Moreover, I don't want to generate a name and search the database to see if the file already exists. I want to make sure 100% or 99% that the generated filename has never been created earlier by my application.

Any idea how I can write such application?

+3  A: 

GUIDs are one way. You're basically guaranteed to not get any repeats (if you have a proper random generator).

sysrqb
+1  A: 

The best way is to simply use a counter. The first file is 1, the next is 2, another is 3, and so on...

But, it seems you want random. To quickly do this, you could make sure that your random number is greater than the last file created. You can cache the last file and then just offset your random number with its last name.

file = last_file + random(1 through 10)
carl
+1, but I think you want "random(1 through 10)" -- if you got a 0, you'd be allocating the same ID as last time.
j_random_hacker
yes, you are right. I have edited your suggestion in.
carl
+2  A: 

You could also append with the time since epoch.

bedwyr
+10  A: 

You could produce a hash based on the file contents itself. There are two good reasons to do this:

  1. Allows you to never store the same file twice - for example, if you have two copies of a music file which are identical in content you could check to see if you have already stored that file, and just store it once.

  2. You separate meta-data (file name is just meta data) from the blob. So you would have a storage system which is indexed by the hash of the file contents, and you then associate the file meta-data with that hash lookup code.

The risk of finding two files that compute the same hash that aren't indeed the same contents, depending on the size of the hash would be low, and you can effectively mitigate that by perhaps hashing the file in chunks (which could then lead to some interesting storage optimisation scenarios :P).

Mitch Denny
Be sure to read the following article if you're going to do something like this: http://www.linuxworld.com/cgi-bin/mailto/x_linux.cgi?pagetosend=/export/home/httpd/linuxworld/news/2007/111207-hash.html
Brian Campbell
Is there a reason source control systems don't detect linkage in this mannger?
ojblass
Some do. Git names all its files in the internal repository after their hashes.
Nick Johnson
+3  A: 

The best solution have already been mentioned. I just want to add some thoughts.

The simplest solution is to have a counter and increment on every new file. This works quite well as long as only one thread creates new files. If multiple threads, processes or even systems add new files, things get a bit more complicated. You must coordinate the creation of new ids with locking or any similar synchronisation method. You could also assign id ranges to every proceses to reduce the synchronisation work, or extend the file id by a unique process id.

A better solution might be to use GUIDs in this scenario and do not have to care about synchronisation between processes.

Finally, you can at some random data to every identifier to make them harder to guess if this is a requirement.

Also coommon is storing files in a directory structure where the location of a file depends on its name. File abcdef1234.xyz might be stored as /ab/cd/ef/1234.xyz. This avoids directories with a huge number of files. I am not really aware why this is done - may be file system limitations, performance issues - but it is quite common. I do not know if similar things are common if the files are stored directly in the database.

Daniel Brückner