I'm trying to keep track of a set of files, which may have the same name and metadata. I'd like to use a hash to differentiate and use it as a unique ID, but I'm not sure which one to use? The files are relatively small (in the 100 kb range) and I'd like to be able to hash that in less than 10 seconds. Which hash (that comes built in in Java 1.5) would best suite my needs?
use a content based hash SHA1 is what I use. MD5 is weaker and faster but on modern processors speed isn't a concern.
Note that a hash of this sort will never be unique though, with the use off an effective one you stand a very good chance of never having a collision.
If you are not concerned with security (i.e. someone deliberately trying to break your hashing) then simply using the MD5 hash will give you an excellent hash with minimal effort.
It is likely that you could do an SHA hash of 100Kb in well less than 10 second though and, though SHA-1 is still theoretically flawed it is of higher strength than MD5.
MessageDigest will get you an implementation of either.
Here are some examples of using it with streams.
Also I should note that this excellent answer from jarnbjo would indicate that even the supplied SHA hashing in Java are well capable of exceeding 20MB/s even on relatively modest x86 hardware. This would imply 5-10 millisecond level performance on 100KB of (in memory) input data so your target of under 10seconds is a massive overestimate of the effort involved. It is likely you will be entirely limited by the rate you can read the files from disk rather than any hashing algorithm you use.
If you have a need for strong crypto hashing you should indicate this in the question. Even then SHA of some flavour above 1 is still likely to be your best bet unless you wish to use an external library like Bouncy Castle since you should never try to roll your own crypto if a well established implementation exists.
For some reasonably efficient sample code I suggest this how to The salient points of which can be distilled into the following (tune the buffer size as you see fit):
import java.io.*;
import java.security.MessageDigest;
public class Checksum
{
const string Algorithm = "SHA-1"; // or MD5 etc.
public static byte[] createChecksum(String filename) throws
Exception
{
InputStream fis = new FileInputStream(filename);
try
{
byte[] buffer = new byte[1024];
MessageDigest complete = MessageDigest.getInstance("MD5");
int numRead;
do
{
numRead = fis.read(buffer);
if (numRead > 0)
{
complete.update(buffer, 0, numRead);
}
} while (numRead != -1);
return complete.digest();
}
finally
{
fis.close();
}
}
}
you could use MessageDigest with SHA1:
MessageDigest messageDigest = MessageDigest.getInstance("SHA1");
InputStream is = new FileInputStream(aFile);
int res;
while ((res = inputStream.read()) != -1) {
digester.update((byte) res);
}
byte[] digest = messageDigest.digest();
here is the way I do it, I think this should work fast, check if it completes in 10 seconds
package utils;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;
/**
* This class used to compute the hash value of any string
*/
public class MyHasher {
private static final String ALGORITHM = "MD5";
static MessageDigest md = null;
static{
try {
md = MessageDigest.getInstance(ALGORITHM);
} catch (NoSuchAlgorithmException e) {
MyLogger.error("Can't find implementation of "+ALGORITHM+" algorithm", e);
}
}
/**
* Compute hash value of any string
* @param arg the string to compute hash value of.
* @return the hex hash value as a string.
*/
public static String getHash(String arg) {
md.update(arg.getBytes());
byte[] hashValue = md.digest();
return convertToHex(hashValue);
}
/**
* Converts byte array to the human readable string of hex'es
* @param data the byte array to convert
* @return string representation of the hex'es of the byte array
*/
public static String convertToHex(byte[] data){
StringBuffer buf = new StringBuffer();
for(int i=0;i<data.length;i++){
int halfbyte = (data[i]>>>3)&0x0F;
int two_halfs = 0;
do{
if((0<=halfbyte) && (halfbyte <=9))
buf.append((char) ('0'+halfbyte));
else
buf.append((char) ('a'+(halfbyte-10)));
halfbyte = data[i] & 0x0F;
}while(two_halfs++ <1);
}
return buf.toString();
}
}