tags:

views:

253

answers:

4

I'm trying to keep track of a set of files, which may have the same name and metadata. I'd like to use a hash to differentiate and use it as a unique ID, but I'm not sure which one to use? The files are relatively small (in the 100 kb range) and I'd like to be able to hash that in less than 10 seconds. Which hash (that comes built in in Java 1.5) would best suite my needs?

A: 

use a content based hash SHA1 is what I use. MD5 is weaker and faster but on modern processors speed isn't a concern.

fuzzy lollipop
+12  A: 

Note that a hash of this sort will never be unique though, with the use off an effective one you stand a very good chance of never having a collision.

If you are not concerned with security (i.e. someone deliberately trying to break your hashing) then simply using the MD5 hash will give you an excellent hash with minimal effort.

It is likely that you could do an SHA hash of 100Kb in well less than 10 second though and, though SHA-1 is still theoretically flawed it is of higher strength than MD5.

MessageDigest will get you an implementation of either.

Here are some examples of using it with streams.

Also I should note that this excellent answer from jarnbjo would indicate that even the supplied SHA hashing in Java are well capable of exceeding 20MB/s even on relatively modest x86 hardware. This would imply 5-10 millisecond level performance on 100KB of (in memory) input data so your target of under 10seconds is a massive overestimate of the effort involved. It is likely you will be entirely limited by the rate you can read the files from disk rather than any hashing algorithm you use.

If you have a need for strong crypto hashing you should indicate this in the question. Even then SHA of some flavour above 1 is still likely to be your best bet unless you wish to use an external library like Bouncy Castle since you should never try to roll your own crypto if a well established implementation exists.

For some reasonably efficient sample code I suggest this how to The salient points of which can be distilled into the following (tune the buffer size as you see fit):

import java.io.*;
import java.security.MessageDigest;

public class Checksum 
{    
    const string Algorithm = "SHA-1"; // or MD5 etc.

    public static byte[] createChecksum(String filename) throws
       Exception
    {
        InputStream fis =  new FileInputStream(filename);
        try
        {
             byte[] buffer = new byte[1024];
             MessageDigest complete = MessageDigest.getInstance("MD5"); 
             int numRead;
             do 
             {
                 numRead = fis.read(buffer);
                 if (numRead > 0) 
                 {
                     complete.update(buffer, 0, numRead);
                 }
             } while (numRead != -1);
             return complete.digest();
         }
         finally
         {
             fis.close();
         }
     }
}
ShuggyCoUk
+1 for the note that hashes will never be unique.
PSpeed
Excellent Answer all around, thanks.
C. Ross
+3  A: 

you could use MessageDigest with SHA1:

    MessageDigest messageDigest = MessageDigest.getInstance("SHA1");
    InputStream is = new FileInputStream(aFile);
    int res;

    while ((res = inputStream.read()) != -1) {
        digester.update((byte) res);
    }

    byte[] digest = messageDigest.digest();
dfa
@downvoter: please explain your downvote or it is pointless
dfa
no clue, but it's a reasonable answer so here's a +1 to offset
ShuggyCoUk
A: 

here is the way I do it, I think this should work fast, check if it completes in 10 seconds

package utils;

import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

/**
 * This class used to compute the hash value of any string  
 */
public class MyHasher {
private static final String ALGORITHM = "MD5";
static MessageDigest md = null;

static{
 try {
  md = MessageDigest.getInstance(ALGORITHM);
 } catch (NoSuchAlgorithmException e) {
  MyLogger.error("Can't find implementation of "+ALGORITHM+" algorithm", e);
 } 
}

/**
 * Compute hash value of any string
 * @param arg the string to compute hash value of.
 * @return the hex hash value as a string.
 */
public static String getHash(String arg) {
 md.update(arg.getBytes());
 byte[] hashValue = md.digest();

 return convertToHex(hashValue);
}
/**
 * Converts byte array to the human readable string of hex'es
 * @param data the byte array to convert
 * @return string representation of the hex'es of the byte array
 */
public static String convertToHex(byte[] data){
 StringBuffer buf = new StringBuffer();
 for(int i=0;i<data.length;i++){
  int halfbyte = (data[i]>>>3)&0x0F;
  int two_halfs = 0;
  do{
   if((0<=halfbyte) && (halfbyte <=9))
    buf.append((char) ('0'+halfbyte));
   else
    buf.append((char) ('a'+(halfbyte-10)));
   halfbyte = data[i] & 0x0F;
  }while(two_halfs++ <1);
 }  
 return buf.toString();
}
}
jutky
I would add the proviso to this that forcing the whole file to be read into memory via a string is likely to be inefficient, require more memory than necessary and have implications if you were wanting to hash an ascii fileas raw bytes rather than forcing it to be become wide characters first (say if you wanted an external standard tool to also be able to hash it)
ShuggyCoUk