views:

131

answers:

3

Suppose you wanted to make a file hosting site for people to upload their files and send a link to their friends to retrieve it later and you want to insure files are duplicated where we store them, is PHP's sha1_file good enough for the task? Is there any reason to not use md5_file instead?

For the frontend, it'll be obscured using the original file name store in a database but some additional concerns would be if this would reveal anything about the original poster. Does a file inherit any meta information with it like last modified or who posted it or is this stuff based in the file system?

Also, is using a salt frivolous since security in regards of rainbow table attack mean nothing to this and the hash could later be used as a checksum?

One last thing, scalability? initially, it's only going to be used for small files a couple of megs big but eventually...

Edit 1: The point of the hash is primarily to avoid file duplication, not to create obscurity.

A: 

Both should be fine. sha1 is a safer hash function than md5, which also means it's slower, which probably means you should use md5 :). You still want to use salt to prevent plaintext/rainbow attacks in case of very small files (don't make assumptions about what people decide to upload to your site). The performance difference will be negligible. You can still use it as a checksum as long as you know the salt.

With respect to scalability, I'd guess that you'll likely going to be IO-bound, not CPU-bound, so I don't think calculating the checksum would give you big overhead, esp. if you do it on the stream as it's being uploaded.

ykaganovich
No - although the sha1 algorithm is more complex / has a higher order, the actual implementation in PHP creates sha1 hashes marginally faster than md5 (at least the last time I checked on PHP 5.1 or something)
symcbean
@symcbean you're probably right, I don't know PHP specifics.
ykaganovich
A: 

SHA should do just fine in any "normal" environment. Although this is what Ben Lynn - the author of "Git Magic" has to say:

A.1. SHA1 Weaknesses As time passes, cryptographers discover more and more SHA1 weaknesses. Already, finding hash collisions is feasible for well-funded organizations. Within years, perhaps even a typical PC will have enough computing power to silently corrupt a Git repository. Hopefully Git will migrate to a better hash function before further research destroys SHA1.

You can always check SHA256, or others which are even longer. Finding MD5 collision is easier than with SHA1.

viraptor
+1  A: 

As per my comment on ykaganovich's answer, sha1 is (surprisingly) slightly faster than md5.

From your description of the problem, you are not trying to create a secure hash - merely hide the file in a large namespace - in which case use of a salt / rainbow tables are irrelevant - the only consideration is the likelihood of a false collision (where 2 different files give the same hash). The probability of this happening with md5 is very, very remote. It's even more remote with sha1. However you do need to think about what happens when 2 independent users upload the same warez to you site. Who owns the file?

In fact, there doesn't seem to be any reason at all to use a hash - just generate a sufficiently long random value.

C.

symcbean
I assumed the added benefit of a checksum warrants the hash?
wag2639
+1 good point, just do a random value :) If you want a checksum, use CRC, although not clear why a checksum is needed.
ykaganovich
I wanted to avoid duplicate files. I'm going to have a sql table to associate owners with files.
wag2639