tags:

views:

136

answers:

5

My Java program saves its data to a binary file, and (very) occasionally the file becomes corrupt due to a hardware fault. Usually only a few bytes are affected in a file that is several megabytes in size. To cope with this problem, I could write the data twice, but this seems overkill - I would prefer to limit the file size increase to about 20%.

This seems to me to be similar to the problem of sending information over a 'noisy' data stream. Is there a Java library or algorithm that can write redundant information to an output stream so the receiver can recover when noise is introduced?

+1  A: 

The problem of noisy communications has already has a great solution: Send a hash/CRC of the data (with the data) which is (re)evaluated by the receiver and re-requested if there was corruption en route.

In other words: use a hash algorithm to check for corruption and retransmit when necessary instead of sending data redundantly.

Paul Sasik
If the error is in the storage, as on the harddrive, he will have lost that information if he only has the hash/CRC of the file. By using an ECC, with enough redundant information, he can potentially correct the error (depending on the extent of the loss).
Mic
I dont think this is talking about data transfer, but data storage. The OP is drawing parallels to how data transfer is verified.
GrayWizardx
Ah yes. i misunderstood the question. Still, hashing the data will work for communications as well as persistence.
Paul Sasik
Mic -- It's a trade-off between the amount of data that Mark is willing to send, and the frequency of errors. If the errors are rare and localized, then a CRC check might take no more than one extra byte per block, while an error-correcting code may take a lot more bytes to send.
Chip Uni
+9  A: 

What you want is Error Correction Codes. Check this code out: http://freshmeat.net/projects/javafec/

As well, the wikipedia article might give you more information: http://en.wikipedia.org/wiki/Forward%5Ferror%5Fcorrection

Your two possiblities are Forward Error Correction, where you send redundant data, or a system for error detection, where you check a hash value and re-request any data that has become corrupted. If corruption is an expected thing, error correction is the approach to take.

Without knowing the nature of your environment, giving more specific advice isn't really possible, but this should get you started on knowing how to approach this problem.

McPherrinM
If you're data comes in in sections, you can use sectional CRC to accomplish this. Depending on the size of the dat you can have a CRC between 4-8 bytes for the data that should be reasonably resistant to corruption. Write the CRC at the end of each section and verify it on retrieval of the section. If the CRC or the data is damaged the section will be invalidated.
GrayWizardx
A: 

Sounds antiquated, but funny, I just had a similar conversation with someone who wrote "mobile" apps (not PDA/phone but Oil&Gas drilling rig-style field applications). Due to the environment they actually wrote to disk in a modified XMODEM CRC transfer. I think it is easy to say however nothing special there other than to:

Use a RandomAccessFile in "rw" write a block of data (512-4096 bytes), re-read for CRC check, re-write if non-match, or iterate to next block. With OS file caching I'm curious how effective this is?

Xepoch
+2  A: 

Error correcting codes. If I recall correctly the number of additional bits goes as log n for the block size, so the larger blocks the fewer correction bits.

You should choose a mechanism that interleaves the checkbits (probably most convenient as extra characters) in between the normal text. This allows for having repairable holes in your data stream while still being readable.

Thorbjørn Ravn Andersen
+1  A: 

CRCs and ECCs are the stand answer to detecting and (for ECCs) recovering from data corruption due to noise. However, any scheme can only cope with a certain level of noise. Beyond that level you will get undetected and/or uncorrectable errors. The second problem is that these schemes will only work if you can add the ECCs / CRCs before the noise is injected.

But I'm a bit suspicious that you may be trying to address the wrong problem:

  • If the corruption is occurring when you transmit the file over a comms line, then you should be using comms hardware with built-in ECC etc support.

  • If the corruption is occurring when you write the file to disc, then you should replace the disc.

  • You should also consider the possibility that it is your application that is corrupting the data; e.g. due to some synchronization bug in your code.

Stephen C