views:

134

answers:

5

Hey guys, I run a rather large site where my members add thousands of images every day. Obviously there is a lot of duplication and i was just wondering if during an upload of an image i can somehow generate a signature or a hash of an image so i can store it. And every time someone uploads the picture i would simply run a check if this signature already exists and fire an error stating that this image already exists. Not sure if this kind of technology already exists for asp.net but i am aware of tineye.com which sort of does it already.

If you think you can help i would appreciate your input.

Kris

+1  A: 

You use any derived HashAlgorithm to generate a hash from the byte array of the file. Usually MD5 is used, but you could subsitute this for any of those provided in the System.Security.Cryptography namespace. This works for any binary, not just images.

Lots of sites provide MD5 hashes when you download files to verify if you've downloaded the file properly. For instance, an ISO CD/DVD image may be missing bytes when you've received the whole thing. Once you've downloaded the file, you generate the hash for it and make sure it's the same as the site says it should be. If all compares, you've got an exact copy.

I would probably use something similar to this:

public static class Helpers
{
    //If you're running .NET 2.0 or lower, remove the 'this' keyword from the
    //method signature as 2.0 doesn't support extension methods.
    static string GetHashString(this byte[] bytes, HashAlgorithm cryptoProvider)
    {
        byte[] hash = cryptoProvider.ComputeHash(bytes);
        return Convert.ToBase64String(hash);
    }
}

Requires:

using System.Security.Cryptography;

Call using:

byte[] bytes = File.ReadAllBytes("FilePath");
string filehash = bytes.GetHashString(new MD5CryptoServiceProvider());

or if you're running in .NET 2.0 or lower:

string filehash = Helpers.GetHashString(File.ReadAllBytes("FilePath"), new MD5CryptoServiceProvider());

If you were to decide to go with a different hashing method instead of MD5 for the miniscule probability of collisions:

string filehash = bytes.GetHashString(new SHA1CryptoServiceProvider());

This way your has method isn't crypto provider specific and if you were to decide you wanted to change which crypto provider you're using, you just inject a different one into the cryptoProvider parameter.

You can use any of the other hashing classes just by changing the service provider you pass in:

string md5Hash = bytes.GetHashString(new MD5CryptoServiceProvider());
string sha1Hash = bytes.GetHashString(new SHA1CryptoServiceProvider());
string sha256Hash = bytes.GetHashString(new SHA256CryptoServiceProvider());
string sha384Hash = bytes.GetHashString(new SHA384CryptoServiceProvider());
string sha512Hash = bytes.GetHashString(new SHA512CryptoServiceProvider());
BenAlabaster
@ChristopheD From my understanding there's a similar chance of collisions with GUID generation but in practical use, there's such a slim chance of it happening it's not worth worrying about.
BenAlabaster
A: 

I don't know if it already exists or not, but I can't think of a reason you can't do this yourself. Something similar to this will get you a hash of the file.

var fileStream = Request.Files[0].InputStream;//the uploaded file
var hasher = System.Security.Cryptography.HMACMD5();
var theHash = hasher.ComputeHash(fileStream);

System.Security.Cryptography

confusedGeek
+1  A: 

Look in the System.Security.Cryptography namespace. You have your choice of serveral hashing algorithms/implementations. Here's an example using md5, but since you have a lot of these you might want something bigger like SHA1:

public byte[] HashImage(Stream imageData)
{
    return new MD5CryptoServiceProvider().ComputeHash(imageData);
}
Joel Coehoorn
The number of images doesn't matter - you aren't going to get a collision in MD5 unless someone does it on purpose.
Mark Byers
+1  A: 

Typically you'd just use MD5 or similar to create a hash. This isn't guaranteed to be unique though, so I'd recommend you use the hash as a starting point. Identify if the image matches any known hashes you stored, then individually load the ones that it does match and do a full byte comparison on the potential collisions to be sure.

Another, simpler technique though is to simply pick a smallish number of bits and read first part of the image... store that number of starting bits as if they were a hash. This still gives you a small number of potential collisions that you'd need to check, but has much less overhead.

Stephen M. Redd
+1 for being the only person to point out that as a hash loses data then it can only be used as a starting point, and cannot be used to determine whether two things are actually equal.
Greg Beech
@Greg - I understand this in theory, but given that in practice the chance of two files producing the same MD5 hash (outside a purely scientific theoretical setting) means that this argument is moot. The hash may be a "lossy" algorithm, but you're not meant to be able to generate the file from the hash. It's designed to test if two files are the same. If files are the same, the hash is the same, if they're different, the hash is different - because the chance of collision is one in billions, the chances of any two different pieces of data on your server having the same hash is miniscule.
BenAlabaster
Hashes are unique enough in most cases, but generating them does require the entire binary be used to produce it. With large images or high volume cases that's a lot of overhead, which is why I advocated using just the first part of the file instead of a hash. But if you do that, the need to double-check is greater.
Stephen M. Redd
A: 

A keyword that might be of interest is perceptual hashing.

TC