ansaurus

Question

Answer 1

+2 A:

The idea is not to change the file content, but rather its name (and path), by using a hash value.

Changing the content with a hash would be disastrous since a hash is normally not reversible.

I'm not sure of the motivivation for using a hash rather than the file name (or even rather than a long random number), but here are a few advantages of the hash appraoch:

the file names on the disk is uniform
the upper or lower parts of the hash value can be used to name the directories and hence distribute the files relatively uniformely
the name becomes a code, making it difficult for someone to a) guess a file name b) categorize pictures (would someone steal the hard drive content)
be able to retrieve the filename and location from the file contents itself (assuming the hash comes from such content. (not quite sure which use case would involve this... a bit contrieved...)

The general interest of using a hash is that unlike a file name, a hash is meaningless, and therefore one would require the database to relate images and "bibliographic" type data (name of uploader, date of upload, tags, ...)

In thinking about it, re-reading the referenced SO response, I don't really see much of an advantage of a hash, as compared to, say, a random number...

Furthermore... some hashes produce a numeric value, typically expressed in hexadecimal (as seen in the refernced SO question) and this could be seen as wasteful, by making the file names longer than they need to be, and hence putting more stress on the file system (bigger directories...)

mjv 2009-11-22 17:20:01

If you use hash then several identical copies of the same files are stored to same location. With random number, the files will be stored to different locations. This may be an advantage or disadvantage, depending on your case.

Juha Syrjälä 2009-11-22 17:44:31

what does someone like flickr or facebook do?

viatropos 2009-11-22 17:52:56

here is some interesting information about facebook's photo storage infrastructure http://www.facebook.com/note.php?note_id=76191543919

tosh 2009-11-22 18:28:35

thanks! http://stackoverflow.com/questions/1779609/how-big-of-a-team-does-it-take-to-make-this-huge-of-a-file-uploading-architecture

viatropos 2009-11-22 19:03:22

Answer 2

+1 A:

The idea is that you need to come up with a name for the photo, and you probably want to scatter the files among a number of directories. One easy way to come up with a unique name is to use the hash.

So the beginning of the hash was peeled off for a multi-level directory structure and the rest of the hash was used for a filename for the jpg.

This has the additional benefit of detecting duplicate uploads.

DigitalRoss 2009-11-22 17:25:47

Answer 3

+2 A:

First of all, if the contents of the files are changing, filename from SHA-digest approach is not very suitable, because the name and location of the file in filesystem must change when the contents of the file changes.

Basically you first compute a SHA-1 or MD5 digest (= hash value) from the contents of the file.

When you have a digest, for example, 00e4f56c0de1c61fdb926e79e8a0a65bd12930c9, you generate a file location and filename from the digest. For example, you split the first few characters from the digest to directory structure and rest of the characters to file name. For example:

 00e4f56c0de1c61fdb926e79e8a0a65bd12930c9 => some/path/00/e4/f5/6c0de1c61fdb926e79e8a0a65bd12930c9.txt

This way you only need to store the SHA-1 digest of the file to database. You can then always find out the right location and the name of the file.

Directories usually also have maximum number of files they can contain, for example maximum of 32000 subdirectories and files per directory. A directory structure based on this kind of hashing makes it unlikely that you store too many files to same directory. Also using hashing like this make sure that every directory has about the same number of files, you won't get into situation where all your files are in same directory.

Juha Syrjälä 2009-11-22 17:26:05

okay, then you'd store the file name and some tags with the hash and whatever else in the database, and you could get the file from the filesystem with the hash. that way you could store some human readable info with the file reference, but not have the file itself. the hash is just for making it uniform, optimized, and easy to program, it doesn't need to be human readable. got it, thanks!

viatropos 2009-11-22 17:32:15

@viatropos, yep, thats about it. You could also give every file an unique number from sequence and use that instead of the SHA-1 digest.

Juha Syrjälä 2009-11-22 17:40:00

if you intend to replace the old file with the new version using the same path, make sure the operation is atomic. else you might get into trouble if someone requests the file while you are still writing down the new one. imho it would not hurt to save the new file to another location. and think about a way to remove the old/outdated versions from time to time if you run into storage-space problems :)

tosh 2009-11-22 18:02:22

ansaurus

tags:

views:

answers:

SHA-1 hash for storing Files

related questions