But is it worth it? If you have a hash for each file, then you essentially have an overhead for each file. Let's say that each file must take up at least 512 bytes (a typical disk sector) and that you're storing these hashes compactly enough so as to not have each hash take up much more than the hash size.
So, even if all your files are 512 bytes, the smallest, you're talking either 16 / 512 = 3.1%
or 32 / 512 = 6.3%
. In reality, I'd bet your average file size is higher (unless all your files are 1 sector...), so that overhead would be less.
Now, the amount of space you need for hashes scales linearly with the number of files you have. Is that extra space worth that much? Even if you had your mentioned trillion files - that's 1 000 000 000 000 * 16 = ~29 TiB
, which is a lot of space, but keep in mind: your data would be 1 000 000 000 000 * 512 = 465 TiB
. The numbers are worthless, really, since it's still 3%
or 6%
overhead. But at this level, where you have a half petabyte of storage, does 15 terabytes matter? At any level, does a 3%
savings mean anything? And remember, if they're larger, you save less. (Which, they probably are: good luck getting a 512 byte sector size at that hard disk size.)
So, is this 3%
or less disk savings worth the potential risk in security. (Which I'll leave unanswered, as it's waaay not my cup of tea.)
Alternatively, could you, say, group files together in some logical fashion, so that you have less files? (I mean, if you have trillions of 512 byte files, do you really want to hash every byte on disk?)