tags:

views:

60

answers:

1

I'm working on a solution where I need to associate metadata with files. In order to be able to associate the right file with the right metadata if the file is moved for instance I need to be able create a "fingerprint" of sorts to identify the file.

The obvious solution would be simply to calculate a hash from the file contents, however it seems calculating an hash from the entire file would be quite time consuming so I was thinking it might be better to just calculate the checksum from a a chunk of the file, like x bytes from the start of the beginning

Another problem is that some files do contain metadata headers that might change, mp3's for instance so the fingerprinting method would have to be able to adopt to what kind of file it is and therefore which "chunk" to best calculate the checksum on...

So my questions are: Is this a good way to do it, have anyone else done something similiar? How many bytes do you think is neeeded to calculate the checksum?

Thanks everyone for your input

+1  A: 

This very much has to do with what exact type of files you are handling.

I wouldn't completely give up on hashing the entire files. Is this a real bottleneck in your app?

If you must hash only parts of the file, you should evaluate what files you are dealing with and which parts of the file you should hash, in order to get as little false hash matches as possible.

Yuval A
Yeah, potential every file on a computer needs to be fingerprinted so it would be a bottleneck and there would be some rules for handling different files , like mp3's, pictures etc that might have a metadata header. I'm curious though how many bytes you'd need for a checksum in general, it doesnt seem like it would have to be that large...
MattiasK