views:

174

answers:

3

I am looking for an efficient means to partially check the integrity of "large" data sets over a slow transfer medium. This seems like a common problem as file sizes grow out of proportion to transfer rates.

For example, for concrete numbers, a terabyte of data over USB2. Checking that this data is still valid by reading every byte into a hash or checksum requires a day, and increases the risk of drive failure.

Instead, this code would need to verify random pieces of data, and provide a probability of validity based on the time available. If allowed to run sufficiently long, all blocks would be verified (base case of reading entire data set).

Usage "story":
-- Data stored in large encrypted containers (sizes 1TB .. 1GB).
-- Each container redundantly backed up on multiple sets of drives at different locations.
-- Verification check must be done without knowledge of underlying data or keys.

What failure modes the approach needs to DETECT:
- Storage transport failures (eg, controller drops portions of the physical address) - Sector errors (no data returned for a particular block)
- Single bit errors (non-ECC memory or caches)

When errors are detected, the data is recovered from redundant storage. The verification data probably must be stored seperately.

Since the goal is data integrity, the techniques from file-sharing networks don't seem applicable -- a "hash tree" would require full storage of the hashes at each node, which seems to be more storage than is necessary for scenarios without active attackers.

  • How can I determine the tradeoff between storage space versus time to read the relevant blocks of the file?
  • If a hash-tree/hash-list is the best way, how safe is it to store partial values of the hashes?
  • Would some checksum or error correcting code be a better choice than hashes for equivalent protection?
A: 

How about storing hash or checksum values for runs of data within the file? You would then only have to read in a restricted portion of the data for a limited validation of the file content.

1800 INFORMATION
+2  A: 

Transfers take place over USB2, right? Therefore you should know that:

  • USB communication are in form of packets, with a payload up to 1024 bytes for high-speed transfers and a 16-bit CRC.
  • Each packet is acknowledged and potentially retransmitted.

You have to take into account these information to deploy an algorithm that add some warrantees over those provided by CRC, otherwise it would be futile. If I remember well a 16-bit CRC can detects any single error bursts not longer than 16 bit and a fraction of those longer.

You can start from wikipedia: http://en.wikipedia.org/wiki/USB2 and http://en.wikipedia.org/wiki/Cyclic_redundancy_check.

Nicola Bonelli
I added some potential failure modes in the description.Even if I stayed with 16-bit CRC, if I used a different polynominal than USB2, wouldn't that provide more detection capability (end-to-end)?
It would certainly protect against different failure scenarios. USB2 would prevent (or decrease the likeilhood of) errors in transmission, but you can still have errors introduced in the storage medium
Nick Johnson
A: 

You might want to try using something like PAR2 to create redundancy data. this will allow you to both check and correct data, and would probably be convertible to use random access.

Hasturkun
Since PAR2 is DETECT+CORRECT, it will add significantly more storage requirements (compared to a hash-tree) - the goal is just to detect, and use pure redundancy for correction.