views:

526

answers:

3

After reading this, it sounds like a great idea to store files using the SHA-1 for the directory.

I have no idea what this means however, all I know is that SHA-1 and MD5 are hashing algorithms. If I calculate the SHA-1 hash using this ruby script, and I change the file's content (which changes the hash), how do I know where the file is stored then?

My question is then, what are the basics of implementing a SHA-1/file-storage system?

If all of the files are changing content all the time, is there a better solution for storing them, or do you just have to keep updating the hash?

I'm just thinking about how to create a generic file storing system like GoogleDocs, Flickr, Youtube, DropBox, etc., something that you could reuse in different environments (such as storing PubMed journal articles or Cramster homework assignments and tests, or just images like on Flickr). I'd probably store them on Amazon EC2. Just some system so I can say "this is how I'll 99% of the time do file storing from now on", so I can stop thinking about building a solid/consistent way to store files and get onto some real problems.

+2  A: 

The idea is not to change the file content, but rather its name (and path), by using a hash value.

Changing the content with a hash would be disastrous since a hash is normally not reversible.

I'm not sure of the motivivation for using a hash rather than the file name (or even rather than a long random number), but here are a few advantages of the hash appraoch:

  • the file names on the disk is uniform
  • the upper or lower parts of the hash value can be used to name the directories and hence distribute the files relatively uniformely
  • the name becomes a code, making it difficult for someone to a) guess a file name b) categorize pictures (would someone steal the hard drive content)
  • be able to retrieve the filename and location from the file contents itself (assuming the hash comes from such content. (not quite sure which use case would involve this... a bit contrieved...)

The general interest of using a hash is that unlike a file name, a hash is meaningless, and therefore one would require the database to relate images and "bibliographic" type data (name of uploader, date of upload, tags, ...)

In thinking about it, re-reading the referenced SO response, I don't really see much of an advantage of a hash, as compared to, say, a random number...

Furthermore... some hashes produce a numeric value, typically expressed in hexadecimal (as seen in the refernced SO question) and this could be seen as wasteful, by making the file names longer than they need to be, and hence putting more stress on the file system (bigger directories...)

mjv
If you use hash then several identical copies of the same files are stored to same location. With random number, the files will be stored to different locations. This may be an advantage or disadvantage, depending on your case.
Juha Syrjälä
what does someone like flickr or facebook do?
viatropos
here is some interesting information about facebook's photo storage infrastructure http://www.facebook.com/note.php?note_id=76191543919
tosh
thanks! http://stackoverflow.com/questions/1779609/how-big-of-a-team-does-it-take-to-make-this-huge-of-a-file-uploading-architecture
viatropos
+1  A: 

The idea is that you need to come up with a name for the photo, and you probably want to scatter the files among a number of directories. One easy way to come up with a unique name is to use the hash.

So the beginning of the hash was peeled off for a multi-level directory structure and the rest of the hash was used for a filename for the jpg.

This has the additional benefit of detecting duplicate uploads.

DigitalRoss
+2  A: 

First of all, if the contents of the files are changing, filename from SHA-digest approach is not very suitable, because the name and location of the file in filesystem must change when the contents of the file changes.


Basically you first compute a SHA-1 or MD5 digest (= hash value) from the contents of the file.

When you have a digest, for example, 00e4f56c0de1c61fdb926e79e8a0a65bd12930c9, you generate a file location and filename from the digest. For example, you split the first few characters from the digest to directory structure and rest of the characters to file name. For example:

 00e4f56c0de1c61fdb926e79e8a0a65bd12930c9 => some/path/00/e4/f5/6c0de1c61fdb926e79e8a0a65bd12930c9.txt

This way you only need to store the SHA-1 digest of the file to database. You can then always find out the right location and the name of the file.

Directories usually also have maximum number of files they can contain, for example maximum of 32000 subdirectories and files per directory. A directory structure based on this kind of hashing makes it unlikely that you store too many files to same directory. Also using hashing like this make sure that every directory has about the same number of files, you won't get into situation where all your files are in same directory.

Juha Syrjälä
okay, then you'd store the file name and some tags with the hash and whatever else in the database, and you could get the file from the filesystem with the hash. that way you could store some human readable info with the file reference, but not have the file itself. the hash is just for making it uniform, optimized, and easy to program, it doesn't need to be human readable. got it, thanks!
viatropos
@viatropos, yep, thats about it. You could also give every file an unique number from sequence and use that instead of the SHA-1 digest.
Juha Syrjälä
if you intend to replace the old file with the new version using the same path, make sure the operation is atomic. else you might get into trouble if someone requests the file while you are still writing down the new one. imho it would not hurt to save the new file to another location. and think about a way to remove the old/outdated versions from time to time if you run into storage-space problems :)
tosh