I am looking for an efficient means to partially check the integrity of "large" data sets over a slow transfer medium. This seems like a common problem as file sizes grow out of proportion to transfer rates.
For example, for concrete numbers, a terabyte of data over USB2. Checking that this data is still valid by reading every byte into a hash or checksum requires a day, and increases the risk of drive failure.
Instead, this code would need to verify random pieces of data, and provide a probability of validity based on the time available. If allowed to run sufficiently long, all blocks would be verified (base case of reading entire data set).
Usage "story":
-- Data stored in large encrypted containers (sizes 1TB .. 1GB).
-- Each container redundantly backed up on multiple sets of drives at different locations.
-- Verification check must be done without knowledge of underlying data or keys.
What failure modes the approach needs to DETECT:
- Storage transport failures (eg, controller drops portions of the physical address)
- Sector errors (no data returned for a particular block)
- Single bit errors (non-ECC memory or caches)
When errors are detected, the data is recovered from redundant storage. The verification data probably must be stored seperately.
Since the goal is data integrity, the techniques from file-sharing networks don't seem applicable -- a "hash tree" would require full storage of the hashes at each node, which seems to be more storage than is necessary for scenarios without active attackers.
- How can I determine the tradeoff between storage space versus time to read the relevant blocks of the file?
- If a hash-tree/hash-list is the best way, how safe is it to store partial values of the hashes?
- Would some checksum or error correcting code be a better choice than hashes for equivalent protection?