views:

68

answers:

2

As part of a Java based web app, I'm going to be accepting uploaded .xls & .csv (and possibly other types of) files. Each file will be uniquely renamed with a combination of parameters and a timestamp.

I'd like to be able to identify any duplicate files. By duplicate I mean, the exact same file regardless of the name. Ideally, I'd like to be able to detect the duplicates as quickly as possible after the upload, so that the server could include this info in the response. (If the processing time by file size doesn't cause too much of a lag.)

I've read about running MD5 on the files and storing the result as unique keys, etc... but I've got a suspicion that there's a much better way. (Is there a better way?)

Any advice on how best to approach this is appreciated.

Thanks.

UPDATE: I have nothing at all against using MD5. I've used it a few times in the past with Perl (Digest::MD5). I thought that in the Java world, another (better) solution might have emerged. But, it looks like I was mistaken.

Thank you all for the answers and comments. I'm feeling pretty good about using MD5 now.

+1  A: 

You only need to add a method like this to your code and you're done. There's probably no better way. All the work is already done by the Digest API.

public static String calc(InputStream is ) {
        String output;
        int read;
        byte[] buffer = new byte[8192];

        try {
            MessageDigest digest = MessageDigest.getInstance("SHA-256"); //"MD5");
            while ((read = is.read(buffer)) > 0) {
                digest.update(buffer, 0, read);
            }
            byte[] hash = digest.digest();
            BigInteger bigInt = new BigInteger(1, hash);
            output = bigInt.toString(16);

        } 
        catch (Exception e) {
            e.printStackTrace( System.err );
            return null;
        }
        return output;
    }
stacker
+1 Great example. Thank you.
S.Jones
+3  A: 

While processing uploaded files, decorate the OutputStream with a DigestOutputStream so that you can calculate the digest of the file while writing. Store the final digest somewhere along with the unique identifier of the file (in hex as part of filename maybe?).

BalusC
+1 That's great. It looks like DigestOutputStream is just what I need. Thanks
S.Jones