views:

359

answers:

6

Is it mathematically feasible to encode and initial 4 byte message into 8 bytes and if one of the 8 bytes is completely dropped and another is wrong to reconstruct the initial 4 byte message? There would be no way to retransmit nor would the location of the dropped byte be known.

If one uses Reed Solomon error correction with 4 "parity" bytes tacked on to the end of the 4 "data" bytes, such as DDDDPPPP, and you end up with DDDEPPP (where E is an error) and a parity byte has been dropped, I don't believe there's a way to reconstruct the initial message (although correct me if I am wrong)...

What about multiplying (or performing another mathematical operation) the initial 4 byte message by a constant, then utilizing properties of an inverse mathematical operation to determine what byte was dropped. Or, impose some constraints on the structure of the message so every other byte needs to be odd and the others need to be even.

Alternatively, instead of bytes, it could also be 4 decimal digits encoded in some fashion into 8 decimal digits where errors could be detected & corrected under the same circumstances mentioned above - no retransmission and the location of the dropped byte is not known.

I'm looking for any crazy ideas anyone might have... Any ideas out there?

EDIT:

It may be a bit contrived, but the situation that I'm trying to solve is one where you have, let's say, a faulty printer that prints out important numbers onto a form, which are then mailed off to a processing firm which uses OCR to read the forms. The OCR isn't going to be perfect, but it should get close with only digits to read. The faulty printer could be a bigger problem, where it may drop a whole number, but there's no way of knowing which one it'll drop, but they will always come out in the correct order, there won't be any digits swapped.

The form could be altered so that it always prints a space between the initial four numbers and the error correction numbers, ie 1234 5678, so that one would know whether a 1234 initial digit was dropped or a 5678 error correction digit was dropped, if that makes the problem easier to solve. I'm thinking somewhat similar to how they verify credit card numbers via algorithm, but in four digit chunks.

Hopefully, that provides some clarification as to what I'm looking for...

+3  A: 

I think you would need to study what erasure codes might offer you. I don't know any bounds myself, but maybe some kind of MDS code might achieve this.

EDIT: After a quick search I found RSCode library and in the example it says that

In general, with E errors, and K erasures, you will need
* 2E + K bytes of parity to be able to correct the codeword
* back to recover the original message data.

So looks like Reed-Solomon code is indeed the answer and you may actually get recovery from one erasure and one error in 8,4 code.

Krystian
I believe with Reed Solomon you need to know the location of the erasure. I suppose that you could try each error position, assuming it's in the initial data position, but if you lose a parity byte, you won't be able to check anything, at least as I understand it...
"If the locations of the errored symbols are not known in advance, then a Reed–Solomon code can correct up to (n − k) / 2 erroneous symbols, i.e., it can correct half as many errors as there are redundant symbols added to the block." - from the wikipedia page (http://en.wikipedia.org/wiki/Reed-Solomon_code#Properties_of_Reed.E2.80.93Solomon_codes)
jimktrains
+1  A: 

Parity codes work as long as two different data bytes aren't affected by error or loss and as long as error isn't equal to any data byte while a parity byte is lost, imho.

Gabriel Ščerbák
...in other cases you can identify parity bytes(two or more equal bytes) and encode 1 bit in their position (odd or even).
Gabriel Ščerbák
A: 

In the case of decimal digits, assuming one goes with first digit odd, second digit even, third digit odd, etc - with two digits, you get 00-99, which can be represented in 3 odd/even/odd digits (125 total combinations) - 00 = 101, 01 = 103, 20 = 181, 99 = 789, etc. So one encodes two sets of decimal digits into 6 total digits, then the last two digits signify things about the first sets of 2 digits or a checksum of some sort... The next to last digit, I suppose, could be some sort of odd/even indicator on each of the initial 2 digit initial messages (1 = even first 2 digits, 3 = odd first two digits) and follow the pattern of being odd. Then, the last digit could be the one's place of a sum of the individual digits, that way if a digit was missing, it would be immediately apparent and could be corrected assuming the last digit was correct. Although, it would throw things off if one of the last two digits were dropped...

+4  A: 

In the absence of "nice" algebraic structure, I suspect that it's going to be hard to find a concise scheme that gets you all the way to 10**4 codewords, since information-theoretically, there isn't a lot of slack. (The one below can use GF(5) for 5**5 = 3125.) Fortunately, the problem is small enough that you could try Shannon's greedy code-construction method (find a codeword that doesn't conflict with one already chosen, add it to the set).


Encode up to 35 bits as a quartic polynomial f over GF(128). Evaluate the polynomial at eight predetermined points x0,...,x7 and encode as 0f(x0) 1f(x1) 0f(x2) 1f(x3) 0f(x4) 1f(x5) 0f(x6) 1f(x7), where the alternating zeros and ones are stored in the MSB.

When decoding, first look at the MSBs. If the MSB doesn't match the index mod 2, then that byte is corrupt and/or it's been shifted left by a deletion. Assume it's good and shift it back to the right (possibly accumulating multiple different possible values at a point). Now we have at least seven evaluations of a quartic polynomial f at known points, of which at most one is corrupt. We can now try all possibilities for the corruption.

EDIT: bmm6o has advanced the claim that the second part of my solution is incorrect. I disagree.

Let's review the possibilities for the case where the MSBs are 0101101. Suppose X is the array of bytes sent and Y is the array of bytes received. On one hand, Y[0], Y[1], Y[2], Y[3] have correct MSBs and are presumed to be X[0], X[1], X[2], X[3]. On the other hand, Y[4], Y[5], Y[6] have incorrect MSBs and are presumed to be X[5], X[6], X[7].

If X[4] is dropped, then we have seven correct evaluations of f.

If X[3] is dropped and X[4] is corrupted, then we have an incorrect evaluation at 3, and six correct evaluations.

If X[5] is dropped and X[4] is corrupted, then we have an incorrect evaluation at 5, and six correct evaluations.

There are more possibilities besides these, but we never have fewer than six correct evaluations, which suffices to recover f.

The first part of your solution is good, but your description is very unclear. I had no idea what you were talking about until I came up with the same solution myself.
Brian
Nice idea, but I have problems to understand how you would treat the case where the MSBs are 0000101. It seems to me that both0c0m0101 and 0m0c0101 (where c indicates a corrupt byte and m indicates a missing byte) are possible. I.e. either the second or the third byte can be f(x2) and the other byte is corrupt. Maybe I'm missing something.
Accipitridae
As you noticed, we receive one correct value and one incorrect value for byte 2. In this case recovery is very easy, since we know that bytes 0, 4, 5, 6, 7 are correct.
Yup, sounds good. Thanks.
Accipitridae
How do you handle one dropped byte and an error at an unknown location? Or all bytes delivered but one error? It's true that "you never have fewer than six correct evaluations", but if you don't know which ones they are, you can't use them. At the top, you talk about Shannon's code - is decoding by brute force (I'm not familiar with the construct and a link would help the OP)?
Brian
A: 

It looks to be theoretically possible if we assume 1 bit error in wrong byte. We need 3 bits to identify dropped byte and 3 bits to identify wrong byte and 3 bits to identify wrong bit. We have 3 times that many extra bits.

But if we need to identify any number of bits error in wrong byte, it comes to 30 bits. Even that looks to be possible with 32 bits, although 32 is a bit too close for my comfort.

But I don't know hot to encode to get that. Try turbocode?

Fakrudeen
+1  A: 

Error correcting codes can in general handle erasures, but in the literature the position of the erasure is assumed known. In most cases, the erasure will be introduced by the demodulator when there is low confidence that the correct data can be retrieved from the channel. For instance, if the signal is not clearly 0 or 1, the device can indicate that the data was lost, rather than risking the introduction of an error. Since an erasure is essentially an error with a known position, they are much easier to fix.

I'm not sure what your situation is where you can lose a single value and you can still be confident that the remaining values are delivered in the correct order, but it's not a situation classical coding theory addresses.

What algorithmist is suggesting above is this: If you can restrict yourself to just 7 bits of information, you can fill the 8th bit of each byte with alternating 0 and 1, which will allow you to know the placement of the missing byte. That is, put a 0 in the high bit of bytes 0, 2, 4, 6 and a 1 in the high bits of the others. On the receiving end, if you only receive 7 bytes, the missing one will have been dropped from between bytes whose high bits match. Unfortunately, that's not quite right: if the erasure and the error are adjacent, you can't know immediately which byte was dropped. E.g., high bits 0101101 could result from dropping the 4th byte, or from an error in the 4th byte and dropping the 3rd, or from an error in the 4th byte and dropping the 5th.

You could use the linear code:

1 0 0 0  0 1 1 1
0 1 0 0  1 0 1 1
0 0 1 0  1 1 0 1
0 0 0 1  1 1 1 0

(i.e. you'll send data like (a, b, c, d, b+c+d, a+c+d, a+b+d, a+b+c) (where addition is implemented with XOR, since a,b,c,d are elements of GF(128))). It's a linear code with distance 4, so it can correct a single-byte error. You can decode with syndrome decoding, and since the code is self-dual, the matrix H will be the same as above.

In the case where there's a dropped byte, you can use the technique above to determine which one it is. Once you've determined that, you're essentially decoding a different code - the "punctured" code created by dropping that given byte. Since the punctured code is still linear, you can use syndrome decoding to determine the error. You would have to calculate the parity-check matrix for each of the shortened codes, but you can do this ahead of time. The shortened code has distance 3, so it can correct any single-byte errors.

Brian
It may be a bit contrived, but the situation that I'm thinking about is one where you have, let's say, a faulty printer that prints out important numbers, let's say credit card numbers, which are then mailed off to a processing firm which uses OCR to read the forms. The OCR isn't going to be perfect, but it should get close with only digits to read. The faulty printer could be a bigger problem, where it may drop a whole number, but there's no way of knowing which one it'll drop. Running a little long here - will edit the initial question with the scenario I am looking to solve...