views:

236

answers:

3

Further to this question: Algorithm for determining a file’s identity

Recap: I'm looking for a cheap algorithm for determining a files identity which works the vast majority of the time.

I went ahead and implemented an algorithm that gives me a "pretty unique" hash per file.

The way my algorithm works is:

  • For files smaller than a certain threshold I use the full files content for the identity hash.

  • For files larger than the threshold I take random N samples of X size.

  • I include the filesize in the hashed data. (meaning all files with different sizes result in a different hash)

Questions:

  • What values should I choose for N and X (how many random samples should I take of which size?) I went with 4 samples of 8K each and am not able to stump the algorithm. I found that increasing the amount of samples quickly decreases the speed of the algorithm (cause seeks are pretty expensive)

  • The maths one: how non-different do my files need to be for this algorithm to blow up. (2 different files with same length end up having the same hash)

  • The optimization one: Are there any ways I can optimize my concrete implementation to improve throughput (I seem to be able to do about 100 files a second on my system).

  • Does this implementation look sane? Can you think of any real world examples where this will fail. (My focus is on media files)

Relevant information:

The algorithm I implemented

Thanks for your help!

+1  A: 

I would avoid an solution like this. I practice it might be close to imposible that two media files have the same size and the same data at coresponding positions for compressed formats. But if you have to deal with uncompressed images or wave files, chances that small local changes are not detected grow .

So I think you should realy hash the whole file. While this seems expensive it might not be if you have access to all the files - for example if you build a file server or something like that. You could build the hash incrementaly.

If you see a new file with an unique file length, just store the file length. If another file with the same length is added, calculate the hashes of both files block by block until they differ. Store the file length, the hash and how many blocks of the file are included in the hash. Whenever you detect matching file lengths and hashes and you have not yet hashed the whole file, you extend the hash by adding more blocks.

Some thoughts about the performance. For small files, the chances of equal file length is quite high - there are not so many diffrent small file lengths. But it is not expensive to hash small files.

For larger files the chances of file lenght collisons goes down as there are more and more possible file lengths. For diffrent media files the chances are very good that they differ directly beyond the header so you will need to hash only a short part of the begining of the file.

Finally you will be sure to detect diffrent files (except for hash collisions) because you will hash the whole file if required.

UPDATE

For movies I would consider the file length practical unique, but files recoded to fit on a given medium probably render this idea void - (S)VCD movies will all be in a small range of file lenghs of about CD-ROM capacity.

But for movie files in general, I would just hash one block (maybe 512 Byte) from the middle of the file. Two different movies with the same image and sound at the same position? Practicaly imposible besides you manipulate files to fail this test. But you could easily generate files to fail all deterministic sampling strategies - so this should not really matter.

Daniel Brückner
RE: "If you see a new file with a unique file length" this is a really tricky problem, cause it may be the original file and it moved somewhere else. I agree that the algorithm is not 100% safe, but I am finding it literally impossible to make it fail with real videos (DVDs / AVIs etc...) I think this a good first level of hashing and is far more robust than length alone.
Sam Saffron
For movies I would consider the file length practical unique. Do you have two different files with the same size? Okay, may be if recoded to fit on a given medium - (S)VCD movies will all be in a small range of file lenghs. But for media files I would just hash one block (maybe 512 Byte) from the middle of the file. Two different movies with the same image and sound at the same position? Practicaly imposible besides you manipulate files to fail this test.
Daniel Brückner
A: 
  1. Don't seek backward and open the file with FILE_FLAG_SEQUENTIAL_SCAN (on Windows).
    (Select X random numbers then sort them).
  2. To seek to far, there is usuauely some data in the read ahead cache.
  3. If you have large files format your partition to have large sector size.
  4. You return a Guid for the Id, Must hash algorithms need more then 128 bit.
Shay Erlichmen
fixed the typo :) The solution sorts the positions so Im not seeking backwards... how do I go about setting FILE_FLAG_SEQUENTIAL_SCAN in .Net? I don't really have access to the low level info from C#...
Sam Saffron
Lowlevel (AFAIK), Use the CreateFile (pinvoke.net is your friend) and use the ctor that excepts the IntPtr.
Shay Erlichmen
Ouch the pain :) what kind of performance benefit would I get, is it 2X faster ?
Sam Saffron
btw, MD5 fits snugly in a Guid which is really handy
Sam Saffron
Shay Erlichmen
I should open a side question about MD5, I know that in theory its bust, but in the real world how broken is it really? are there realistic cases when looking at your 100K+ files on your PC where MD5 lies?
Sam Saffron
Shay, regarding MD5 make sure you read this http://stackoverflow.com/questions/800685/which-hash-function-should-i-choose/817121#817121
Sam Saffron
+1  A: 
  • Always include 1st and last block of file in hash.

This is because they're most likely to be different from file to file. If you consider BMP, it may have fairly standard header (like 800x600 image, 24bit, null rest), so you may want to overshoot the header a bit to get to the differentiating data. The problem is that headers vary wildly in size.

Last block is for fileformats that append data to original.

  • Read in blocks of size that is native to the filesystem you use, or at least divisible by 512.
  • Always read blocks at offset that is divisible by blocksize.
  • If you get same has for same sized file, do the deep scan of it (hash all data) and memorize filepath to not scan it again.

Even then unless you're lucky you will misidentify some files as same (for example SQL Server database file and it's 1:1 backup copy after only a few insertions; except that SS does write a timestamp..)

Pasi Savolainen
first and last block are an interesting optimisation (the idea of optimizing for a particular format is really appealing for example VOBs are problematic in that way). Regarding reading divisible blocks, I guess this helps provided the FS is not fragmented. Yeah the deep scan idea may be a good trick to ensure this really never fails.
Sam Saffron