views:

316

answers:

3

Hi all,

I've run into what I believe is an issue with the BinaryReader.ReadChars() method. When I wrap a BinaryReader around a raw socket NetworkStream occasionally I get a stream corruption where the stream being read gets out of sync. The stream in question contains messages in a binary serialisation protocol.

I've tracked this down to the following

  • It only happens when reading a unicode string (encoded using the Encoding.BigEndian)
  • It only happens when the string in question is split across two tcp packets (confirmed using wireshark)

I think what is happening is the following (in the context of the example below)

  • BinaryReader.ReadChars() is called asking it to read 3 characters (string lengths are encoded before the string itself)
  • First loop internally requests a read of 6 bytes (3 remaining characters * 2 bytes/char) off the network stream
  • Network stream only has 3 bytes available
  • 3 bytes read into local buffer
  • Buffer handed to Decoder
  • Decoder decodes 1 char, and keeps the other byte in it's own internal buffer
  • Second loop internally requests a read of 4 bytes! (2 remaining characters * 2 bytes/char)
  • Network stream has all 4 bytes available
  • 4 bytes read into local buffer
  • Buffer handed to Decoder
  • Decoder decodes 2 char, and keeps the remaining 4th bytes internally
  • String decode is complete
  • Serialisation code attempts to unmarshal the next item and croaks because of stream corruption.

    char[] buffer = new char[3];
    int charIndex = 0;
    
    
    Decoder decoder = Encoding.BigEndianUnicode.GetDecoder();
    
    
    // pretend 3 of the 6 bytes arrives in one packet
    byte[] b1 = new byte[] { 0, 83, 0 };
    int charsRead = decoder.GetChars(b1, 0, 3, buffer, charIndex);
    charIndex += charsRead;
    
    
    // pretend the remaining 3 bytes plus a final byte, for something unrelated,
    // arrive next
    byte[] b2 = new byte[] { 71, 0, 114, 3 };
    charsRead = decoder.GetChars(b2, 0, 4, buffer, charIndex);
    charIndex += charsRead;
    

I think the root is a bug in the .NET code which uses charsRemaining * bytes/char each loop to calculate the remaining bytes required. Because of the extra byte hidden in the Decoder this calculation can be off by one causing an extra byte to be consumed off the input stream.

Here's the .NET framework code in question

    while (charsRemaining>0) { 
        // We really want to know what the minimum number of bytes per char 
        // is for our encoding.  Otherwise for UnicodeEncoding we'd have to
        // do ~1+log(n) reads to read n characters. 
        numBytes = charsRemaining;
        if (m_2BytesPerChar)
            numBytes <<= 1;

        numBytes = m_stream.Read(m_charBytes, 0, numBytes);
        if (numBytes==0) { 
            return (count - charsRemaining); 
        } 
        charsRead = m_decoder.GetChars(m_charBytes, 0, numBytes, buffer, index);

        charsRemaining -= charsRead;
        index+=charsRead;
    }

I'm not entirely sure if this is a bug or just a misuse of the API. To work round this issue I'm just calculating the bytes required myself, reading them, and then running the byte[] through the relevant Encoding.GetString(). However this wouldn't work for something like UTF-8.

Be interested to hear people's thoughts on this and whether I'm doing something wrong or not. And maybe it will save the next person a few hours/days of tedious debugging.

EDIT: posted to connect Connect tracking item

A: 

Interesting; you could report this on "connect". As a stop-gap, you could also try wrapping with BufferredStream, but I expect this is papering over a crack (it may still happen, but less frequently).

The other approach, of course, is to pre-buffer an entire message (but not the entire stream); then read from something like MemoryStream - assuming your network protocol has logical (and ideally length-prefixed, and not too big) messages. Then when it is decoding all the data is available.

Marc Gravell
+2  A: 

I have reproduced the problem you mentioned with BinaryReader.ReadChars.

Although the developer always needs to account for lookahead when composing things like streams and decoders, this seems like a fairly significant bug in BinaryReader because that class is intended for reading data structures composed of various types of data. In this case, I agree that ReadChars should have been more conservative in what it read to avoid losing that byte.

There is nothing wrong with your workaround of using the Decoder directly, after all that is what ReadChars does behind the scenes.

Unicode is a simple case. If you think about an arbitrary encoding, there really is no general purpose way to ensure that the correct number of bytes are consumed when you pass in a character count instead of a byte count (think about varying length characters and cases involving malformed input). For this reason, avoiding BinaryReader.ReadChars in favor of reading the specific number of bytes provides a more robust, general solution.

I would suggest that you bring this to Microsoft's attention via http://connect.microsoft.com/visualstudio.

binarycoder
Thanks for confirming, posted it to connect, who are looking into it.
Mike Q
A: 

This reminds of one of my own questions (http://stackoverflow.com/questions/1264952/reading-from-a-httpresponsestream-fails) where I had an issue that when reading from a HTTP response stream the StreamReader would think it had hit the end of the stream prematurely so my parsers would bomb out unexpectedly.

Like Marc suggested for your problem I first tried pre-buffering in a MemoryStream which works well but means you may have to wait a long time if you have a large file to read (especially from the network/web) before you can do anything useful with it. I eventually settled on creating my own extension of TextReader which overrides the Read methods and defines them using the ReadBlock method (which does a blocking read i.e. it waits until it can get exactly the number of characters you ask for)

Your problem is probably due like mine to the fact that Read methods aren't guarenteed to return the number of characters you ask for, for example if you look at the documentation for the BinaryReader.Read (http://msdn.microsoft.com/en-us/library/ms143295.aspx) method you'll see that it states:

Return Value
Type: System..::.Int32
The number of characters read into buffer. This might be less than the number of bytes requested if that many bytes are not available, or it might be zero if the end of the stream is reached.

Since BinaryReader has no ReadBlock methods like a TextReader all you can do is take your own approach of monitoring the position yourself or Marc's of pre-caching.

RobV