Hi all,
I've run into what I believe is an issue with the BinaryReader.ReadChars() method. When I wrap a BinaryReader around a raw socket NetworkStream occasionally I get a stream corruption where the stream being read gets out of sync. The stream in question contains messages in a binary serialisation protocol.
I've tracked this down to the following
- It only happens when reading a unicode string (encoded using the Encoding.BigEndian)
- It only happens when the string in question is split across two tcp packets (confirmed using wireshark)
I think what is happening is the following (in the context of the example below)
- BinaryReader.ReadChars() is called asking it to read 3 characters (string lengths are encoded before the string itself)
- First loop internally requests a read of 6 bytes (3 remaining characters * 2 bytes/char) off the network stream
- Network stream only has 3 bytes available
- 3 bytes read into local buffer
- Buffer handed to Decoder
- Decoder decodes 1 char, and keeps the other byte in it's own internal buffer
- Second loop internally requests a read of 4 bytes! (2 remaining characters * 2 bytes/char)
- Network stream has all 4 bytes available
- 4 bytes read into local buffer
- Buffer handed to Decoder
- Decoder decodes 2 char, and keeps the remaining 4th bytes internally
- String decode is complete
Serialisation code attempts to unmarshal the next item and croaks because of stream corruption.
char[] buffer = new char[3]; int charIndex = 0; Decoder decoder = Encoding.BigEndianUnicode.GetDecoder(); // pretend 3 of the 6 bytes arrives in one packet byte[] b1 = new byte[] { 0, 83, 0 }; int charsRead = decoder.GetChars(b1, 0, 3, buffer, charIndex); charIndex += charsRead; // pretend the remaining 3 bytes plus a final byte, for something unrelated, // arrive next byte[] b2 = new byte[] { 71, 0, 114, 3 }; charsRead = decoder.GetChars(b2, 0, 4, buffer, charIndex); charIndex += charsRead;
I think the root is a bug in the .NET code which uses charsRemaining * bytes/char each loop to calculate the remaining bytes required. Because of the extra byte hidden in the Decoder this calculation can be off by one causing an extra byte to be consumed off the input stream.
Here's the .NET framework code in question
while (charsRemaining>0) {
// We really want to know what the minimum number of bytes per char
// is for our encoding. Otherwise for UnicodeEncoding we'd have to
// do ~1+log(n) reads to read n characters.
numBytes = charsRemaining;
if (m_2BytesPerChar)
numBytes <<= 1;
numBytes = m_stream.Read(m_charBytes, 0, numBytes);
if (numBytes==0) {
return (count - charsRemaining);
}
charsRead = m_decoder.GetChars(m_charBytes, 0, numBytes, buffer, index);
charsRemaining -= charsRead;
index+=charsRead;
}
I'm not entirely sure if this is a bug or just a misuse of the API. To work round this issue I'm just calculating the bytes required myself, reading them, and then running the byte[] through the relevant Encoding.GetString(). However this wouldn't work for something like UTF-8.
Be interested to hear people's thoughts on this and whether I'm doing something wrong or not. And maybe it will save the next person a few hours/days of tedious debugging.
EDIT: posted to connect Connect tracking item