ansaurus

Question

Generate File Names Automatically without collision

Answer 1

+3 A:

GUIDs are one way. You're basically guaranteed to not get any repeats (if you have a proper random generator).

sysrqb 2009-03-31 02:37:32

Answer 2

+1 A:

The best way is to simply use a counter. The first file is 1, the next is 2, another is 3, and so on...

But, it seems you want random. To quickly do this, you could make sure that your random number is greater than the last file created. You can cache the last file and then just offset your random number with its last name.

file = last_file + random(1 through 10)

carl 2009-03-31 02:38:44

+1, but I think you want "random(1 through 10)" -- if you got a 0, you'd be allocating the same ID as last time.

j_random_hacker 2009-03-31 03:49:21

yes, you are right. I have edited your suggestion in.

carl 2009-03-31 04:04:06

Answer 3

+2 A:

You could also append with the time since epoch.

bedwyr 2009-03-31 02:40:21

Answer 4

+10 A:

You could produce a hash based on the file contents itself. There are two good reasons to do this:

Allows you to never store the same file twice - for example, if you have two copies of a music file which are identical in content you could check to see if you have already stored that file, and just store it once.
You separate meta-data (file name is just meta data) from the blob. So you would have a storage system which is indexed by the hash of the file contents, and you then associate the file meta-data with that hash lookup code.

The risk of finding two files that compute the same hash that aren't indeed the same contents, depending on the size of the hash would be low, and you can effectively mitigate that by perhaps hashing the file in chunks (which could then lead to some interesting storage optimisation scenarios :P).

Mitch Denny 2009-03-31 02:56:06

Be sure to read the following article if you're going to do something like this: http://www.linuxworld.com/cgi-bin/mailto/x_linux.cgi?pagetosend=/export/home/httpd/linuxworld/news/2007/111207-hash.html

Brian Campbell 2009-03-31 04:14:31

Is there a reason source control systems don't detect linkage in this mannger?

ojblass 2009-03-31 04:36:36

Some do. Git names all its files in the internal repository after their hashes.

Nick Johnson 2009-03-31 13:29:27

Answer 5

+3 A:

The best solution have already been mentioned. I just want to add some thoughts.

The simplest solution is to have a counter and increment on every new file. This works quite well as long as only one thread creates new files. If multiple threads, processes or even systems add new files, things get a bit more complicated. You must coordinate the creation of new ids with locking or any similar synchronisation method. You could also assign id ranges to every proceses to reduce the synchronisation work, or extend the file id by a unique process id.

A better solution might be to use GUIDs in this scenario and do not have to care about synchronisation between processes.

Finally, you can at some random data to every identifier to make them harder to guess if this is a requirement.

Also coommon is storing files in a directory structure where the location of a file depends on its name. File abcdef1234.xyz might be stored as /ab/cd/ef/1234.xyz. This avoids directories with a huge number of files. I am not really aware why this is done - may be file system limitations, performance issues - but it is quite common. I do not know if similar things are common if the files are stored directly in the database.

Daniel Brückner 2009-03-31 02:57:24

ansaurus

tags:

views:

answers:

Generate File Names Automatically without collision

related questions