Further to this question: Algorithm for determining a file’s identity
Recap: I'm looking for a cheap algorithm for determining a files identity which works the vast majority of the time.
I went ahead and implemented an algorithm that gives me a "pretty unique" hash per file.
The way my algorithm works is:
For files smaller than a certain threshold I use the full files content for the identity hash.
For files larger than the threshold I take random N samples of X size.
I include the filesize in the hashed data. (meaning all files with different sizes result in a different hash)
Questions:
What values should I choose for N and X (how many random samples should I take of which size?) I went with 4 samples of 8K each and am not able to stump the algorithm. I found that increasing the amount of samples quickly decreases the speed of the algorithm (cause seeks are pretty expensive)
The maths one: how non-different do my files need to be for this algorithm to blow up. (2 different files with same length end up having the same hash)
The optimization one: Are there any ways I can optimize my concrete implementation to improve throughput (I seem to be able to do about 100 files a second on my system).
Does this implementation look sane? Can you think of any real world examples where this will fail. (My focus is on media files)
Relevant information:
Thanks for your help!