tags:

views:

832

answers:

5

I am using a Socket to receive data via TCP, and TextReader.ReadLine to read lines from the connection. There is a problem where a full line has not been received -- TextReader.ReadLine returns an incomplete string. I want it to return null, indicating that a full line could not be read. How can I do this?

Basically, I have this data incoming:

"hello\nworld\nthis is a test\n"

When I run ReadLine I get these in return:

"hello"
"world"
"this is a te"
<null>
<socket gets more data>
"st"
<null>

I do not want "this is a te" returned. Rather, I want "this is a test" to wait until the entire line has been received.

Code:

var endPoint = ...;
var socket = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.IP);
socket.Connect(endPoint);

var stream = new NetworkStream(socket, true);

var messageBuffer = new StringBuilder();

// Data received async callback (called several times).
int bytesRead = stream.EndRead(result);
string data = Encoding.UTF8.GetString(readBuffer.Take(bytesRead).ToArray());
messageBuffer.Append(data);

using(var reader = new StringReader(messageBuffer.ToString()))
{
    // This loop does not know that Message.Read reads lines.  For all it knows, it could read bytes or words or the whole stream.

    while((Message msg = Message.Read(reader)) != null)  // See below.
    {
        Console.WriteLine(msg.ToString());    // See example input/echo above.
    }

    messageBuffer = new StringBuilder(reader.ReadToEnd());
}

// Method of Message.
public static Message Read(TextReader reader)
{
    string line = reader.ReadLine();

    if(line == null)
        return null;

    return Message.FromRawString(line);
}

Thanks.

+3  A: 

It sounds like the data is being sent with some extra delimiters. Assuming you're using a StreamReader over a network stream, it should behave exactly as you expect. I suggest you use Wireshark to look at the exact data your socket is receiving.

I also doubt that it's returning null and then another line - are you sure you don't mean it returns an empty string and then another line?

EDIT: Now you've posted the code, the reason is a lot clearer - you're decoding just a single buffer at a time. That really won't work, and could break in much more serious ways. The buffer might not even break at a character boundary.

To be honest, it'll be a lot easier to read synchronously and use a StreamReader. Doing it asynchronously, you should use a System.Text.Decoder which can store any previous state (from the end of the previous buffer) if it needs to. You'll also have to store however much of the previous line was read - and I suspect you won't be able to use TextReader at all, or at least you'll have to have special handling for the case where the final character is '\r' or '\n'. Bear in mind that one buffer could end with '\r' and the next buffer start with '\n', representing a single line break between them. See how difficult it can get?

Do you definitely, definitely need to handle this asynchronously?

EDIT: It sounds like you could do with something which you can basically dump data into, and attach a "LineCompleted" event handler. You could make attach the event handler to start with and then just keep dumping data into it until there's no more data (at which point you'd need to tell it that the data has finished). If that sounds appropriate, I might try to work on such a class for MiscUtil - but I'd be unlikely to finish it within the next week (I'm really busy at the moment).

Jon Skeet
I don't definitely, definitely need asynchronous access, no. But it definitely simplifies my code (at least for the structure I have at the moment). I am fine with \r and \n and \r\n and \n\r all being treated as a newline (in fact I desire that).
strager
I will look into using System.Text.Decoder.
strager
@strager: The point about treating "\r\n" as a single line break is that it becomes trickier if the \r was at the end of the previous buffer and the \n is at the start of the next one. There's a lot of state to worry about. You might like to look at my ReverseLineReader which had to deal (cont)
Jon Skeet
with similar issues - in fact, a lot of the code may well translate relatively easily. It's still going to be a pain compared with just using a StreamReader synchronously though. Could you read the whole lot asynchronously and *then* process it? That could work for short connections.
Jon Skeet
The code for ReverseLineReader is at http://stackoverflow.com/questions/452902/how-to-read-a-text-file-reversely-with-iterator-in-c by the way.
Jon Skeet
@Jon Skeet, Empty messages are ignored, so that's not much of an issue. I don't understand what you mean by the comment "Could you read the whole lot asynchronously and *then* process it?" -- could you elaborate more? I will look into your ReverseLineReader class regardless.
strager
I was wondering if you could read asynchronously to the end of the stream, buffering all the data in a MemoryStream, and then put a StreamReader on top and read all the strings synchronously at that point. That would be easy, but will only work well if there isn't much data.
Jon Skeet
@Jon Skeet, Network streams are unreliable and I may have an incomplete message for some time. Unless I'm not understanding you fully, the method you suggest with MemoryStream would not work.
strager
@Jon Skeet, On your LineCompleted event handler and such, read my updated code. I do not know if Message.Read reads a line or a byte or a word. Is it possible to peek to the end of the TextReader, looking for \r or \n?
strager
Message.Read appears to read a line, according to your code. You can't really peek ahead in a TextReader. I don't think TextReader is really going to work for you here - you'll need to use Encoding/Encoder/Decoder directly, I suspect.
Jon Skeet
@Jon Skeet, I have looked at Decoder and do not see how it could solve my problem. To note, I am fine with changing Message.Read to accept some other arguments, but don't know how I would do that cleanly so the position in the buffer is moved ahead to the next command. I'll hack at it more. Thanks!
strager
@strager: Decoder isn't a solution to your problem - it's a very small *part* of a solution. It stores state between calls so that you don't have to worry as much about the "split character" issue.
Jon Skeet
Downvoters: explanations for downvotes are always welcome...
Jon Skeet
A: 

Have a buffer (starts empty), and each time you read

  • if there is an \n in the buffer, remove everything everything upto and including it and return it
  • read what you can, and append what you read to the buffer
  • if the read fails due to eof, return and clear the contents unless the buffer is empty, in which case propogate the eof.
  • if there is an \n in what you read, try again from the top, else return null

Note that this will do what you want but, with any such scheme, you now have to worry about what to do with lines that are too long for your buffer.

-- MarkusQ

MarkusQ
I already use a buffered approach (see example code). You're saying I should check for my end-of-line markers manually before calling ReadLine?
strager
A: 

See my answer to a previous very similar question. It relates to asynchronous socket I/O and reading lines in a stream-like fashion. Hope that helps.

Noldorin
Aha. I saw your code before, and stole most of the ideas. I didn't, however, steal the essential part, which is why I am having my problem. Will implement and report. =]
strager
Okay, there's a problem. The actual ReadLine is done in some other function. My read loop doesn't know about the ReadLine. I will update my post to reflect this issue.
strager
That answer uses an ASCII encoding, which makes life a lot easier but won't cope with any non-ASCII text (of course). It's a bit harder with UTF-8, where the "read" might finish half way through a character.
Jon Skeet
@Jon Skeet, I need UTF-8 support, so that's another issue with this answer.
strager
@strager: Ah sorry, I forgot you were the guy who commented on my post...
Noldorin
Yeah, Jon Skeet makes a good point too. You could possibly work around that by using the Decoder class as he points out (or doing some ugly hack by reflecting on StreamReader). Either way, it won't be simple I'm afraid...
Noldorin
A: 

Several issues can be seen here:

  1. A single Unicode code point could be split across packets, so you need to keep you own instance of Utf8Encoding around. Alternatively buffer up the complete message as byte[] and convert in one go when you know it is complete.

  2. You need a way of determining when you have received a complete message. You need to keep reading until it is complete (and handle the case where you start receiving the next packet in the same Read call.

Richard
Number two is a problem because only Message.Read should know when a message is complete (and return null if it isn't). How would I do this? I already have my data buffered, by the way.
strager
The division into messages has to be part of the buffering. With networking the physical reality of networks cannot be completely abstracted away.
Richard
A: 

I decided to write my own ReadLine parser-ish kinda thing. Here's the code:

// Async callback.
Message message;

while((message = Message.ReadBytes(messageBuffer)) != null)
{
    OnMessageReceived(new MessageEventArgs(message));
}

// Message class.
public static Message ReadBytes(List<byte> data)
{
    int end = data.FindIndex(b => b == '\n' || b == '\r');

    if(end == -1)
     return null;

    string line = Encoding.UTF8.GetString(data.Take(end).ToArray());

    data.RemoveRange(0, end + 1);

    if(line == "")
     return ReadBytes(data);

    if(line == null)
     return null;

    return Message.FromRawString(line);
}

Many thanks to @Jon Skeet, @Noldorin, and @Richard for their very helpful suggestions. Your combined efforts led me to my final solution. =]

strager
Why are you passing "data" by reference? You're never assigning to it within ReadBytes. If you're going to use this approach, I'm sure there are more efficient ways of doing it.
Jon Skeet
@Jon Skeet, Because I am modifying its data. Maybe I don't need this? I come from a C++ background, where references would be used. (I'm using data.RemoveRange to modify the data.)
strager
@Jon Skeet, Ah, apparently I don't need this. (I feel unsafe now that random methods can modify my reference types!) Thanks for mentioning so.
strager