ansaurus

Question

What is a good method to handle line based network I/O streams?

Answer 1

A:

What you're explaining in you're question, reminds me very much of ASCIZ strings. (link text). That may be a helpfull start.

I had to write something similar to this in college for a project I was working on. Unfortunatly, I had control over the sending socket, so I inserted a length of message field as part of the protocol. However, I think that a similar approach may benefit you.

How I approached my solution was I would send something like 5HELLO, so first I'd see 5, and know I had message length 5, and therefor the message I needed was 5 characters. However, if on my async read, i only got 5HE, i would see that I have message length 5, but I was only able to read 3 bytes off the wire (Let's assume ASCII characters). Because of this, I knew I was missing some bytes, and stored what I had in fragment buffer. I had one fragment buffer per socket, therefor avoiding any synchronization problems. The rough process is.

Read from socket into a byte array, record how many bytes was read
Scan through byte by byte, until you find a newline character (this becomes very complex if you're not receiving ascii characters, but characters that could be multiple bytes, you're on you're own for that)
Turn you're frag buffer into a string, and append you're read buffer up until the new line to it. Drop this string as a completed message onto a queue or it's own delegate to be processed. (you can optimize these buffers by actually having you're read socket writing to the same byte array as you're fragment, but that's harder to explain)
Continue looping through, every time we find a new line, create a string from the byte arrange from a recorded start / end position and drop on queue / delegate for processing.
Once we hit the end of our read buffer, copy anything that's left into the frag buffer.
Call the BeginRead on the socket, which will jump to step 1. when data is available in the socket.

Then you use another Thread to read you're queue of incommign messages, or just let the Threadpool handle it using delegates. And do whatever data processing you have to do. Someone will correct me if I'm wrong, but there is very little thread synchronization issues with this, since you can only be reading or waiting to read from the socket at any one time, so no worry about locks (except if you're populating a queue, I used delegates in my implementation). There are a few details you will need to work out on you're own, like how big of a frag buffer to leave, if you receive 0 newlines when you do a read, the entire message must be appended to the fragment buffer without overwriting anything. I think it ran me about 700 - 800 lines of code in the end, but that included the connection setup stuff, negotiation for encryption, and a few other things.

This setup performed very well for me; I was able to perform up to 80Mbps on 100Mbps ethernet lan using this implementation a 1.8Ghz opteron including encryption processing. And since you're tied to the socket, the server will scale since multiple sockets can be worked on at the same time. If you need items processed in order, you'll need to use a queue, but if order doesn't matter, then delegates will give you very scalable performance out of the threadpool.

Hope this helps, not meant to be a complete solution, but a direction in which to start looking.

*Just a note, my implementation was down purely at the byte level and supported encryption, I used characters for my example to make it easier to visualize.

Kevin Nisbet 2009-02-08 01:13:51

Yes, I've implemented an approach similar to this already, but I don't like it. It's too messy and complex for my tastes, that's why I'm asking for suggestions here. I like Noldorin's approach, it has the elgance and reuse of existing framework code i desire.

Mystere Man 2009-02-08 02:29:26

Answer 2

+3 A:

That's quite an interesting question. The solution for me in the past has been to use a separate thread with synchronous operations, as you propose. (I managed to get around most of the problems with blocking sockets using locks and lots of exception handlers.) Still, using the in-built asynchronous operations is typically advisable as it allows for true OS-level async I/O, so I understand your point.

Well I've gone and written a class for accomplishing what I believe you need (in a relatively clean manner I would say). Let me know what you think.

using System;
using System.Collections.Generic;
using System.IO;
using System.Text;

public class AsyncStreamProcessor : IDisposable
{
    protected StringBuilder _buffer;  // Buffer for unprocessed data.

    private bool _isDisposed = false; // True if object has been disposed

    public AsyncStreamProcessor()
    {
        _buffer = null;
    }

    public IEnumerable<string> Process(byte[] newData)
    {
        // Note: replace the following encoding method with whatever you are reading.
        // The trick here is to add an extra line break to the new data so that the algorithm recognises
        // a single line break at the end of the new data.
        using(var newDataReader = new StringReader(Encoding.ASCII.GetString(newData) + Environment.NewLine))
        {
            // Read all lines from new data, returning all but the last.
            // The last line is guaranteed to be incomplete (or possibly complete except for the line break,
            // which will be processed with the next packet of data).
            string line, prevLine = null;
            while ((line = newDataReader.ReadLine()) != null)
            {
                if (prevLine != null)
                {
                    yield return (_buffer == null ? string.Empty : _buffer.ToString()) + prevLine;
                    _buffer = null;
                }
                prevLine = line;
            }

            // Store last incomplete line in buffer.
            if (_buffer == null)
                // Note: the (* 2) gives you the prediction of the length of the incomplete line, 
                // so that the buffer does not have to be expanded in most/all situations. 
                // Change it to whatever seems appropiate.
                _buffer = new StringBuilder(prevLine, prevLine.Length * 2);
            else
                _buffer.Append(prevLine);
        }
    }

    public void Dispose()
    {
        Dispose(true);
        GC.SuppressFinalize(this);
    }

    private void Dispose(bool disposing)
    {
        if (!_isDisposed)
        {
            if (disposing)
            {
                // Dispose managed resources.
                _buffer = null;
                GC.Collect();
            }

            // Dispose native resources.

            // Remember that object has been disposed.
            _isDisposed = true;
        }
    }
}

An instance of this class should be created for each NetworkStream and the Process function should be called whenever new data is received (in the callback method for BeginRead, before you call the next BeginRead I would imagine).

Note: I have only verified this code with test data, not actual data transmitted over the network. However, I wouldn't anticipate any differences...

Also, a warning that the class is of course not thread-safe, but as long as BeginRead isn't executed again until after the current data has been processed (as I presume you are doing), there shouldn't be any problems.

Hope this works for you. Let me know if there are remaining issues and I will try to modify the solution to deal with them. (There could well be some subtlety of the question I missed, despite reading it carefully!)

Noldorin 2009-02-08 01:33:22

This is an interesting solution. I too have found Iterators to be useful, but this was not a solution my mind would have come up with. I like it.

Mystere Man 2009-02-08 02:24:36

Can you explain why you need to implement IDispose? I've been told that calling GC.Collect() is bad practice and can result in poor performance. Are you concerned about rapid allocations within a short time exhausting the heap?

Mystere Man 2009-02-08 02:34:37

Yeah, iterators are handy things. In this case you could just as well do it with a generic List, though it may not look so nice of course. If you want to deal with the result as a List/Array, it's trivial to convert to those types anyway, and the implementation is still simpler.

Noldorin 2009-02-08 02:55:57

Regarding the use of IDisposable, it is possibly not *necessary*. The null allocation followed by the GC.Collect is used to insure that the memory for its buffer is freed up immediately. Depending on how long lines can get, this may or may not be much of an issue.

Noldorin 2009-02-08 02:58:12

(contd) The Dispose method (and thus GC.Collect) should only be called when the associated connection/NetworkStream is closed, which shouldn't be too often, so I wouldn't worry about performance. (There seems to be an alternative however: http://dotnettipoftheday.org/tips/dispose_stringbuilder.aspx)

Noldorin 2009-02-08 03:03:01

Also, your comments indicate that the last line is guaranteed to be incomplete. This is not true, since it is possible for it to be complete (including final newline). For instance, after a batch of data is sent, the last line will be complete. It's also possible it just might accidentally be.

Mystere Man 2009-02-08 03:38:40

The code *seems* to work even if the last line is newline terminated, but maybe i'm missing something. Is there a potential problem if the last line is newline terminated?

Mystere Man 2009-02-08 03:39:54

That's the "trick". Because a new line is appended to all new data before processing, it would treat new data ending in a line break by reading the last complete line, then a zero-length (string.Empty) incomplete line, which is stored into the buffer, all finally followed by the null line.

Noldorin 2009-02-08 11:57:05

In fact, to see truly see what's going on, I recommend you create some test data (ending with a line break/an incomplete line), and step through the code of the method in the two cases. Anyway this approach has got me thinking - I'll have to test it against my thread-with-sync-reads at some point.

Noldorin 2009-02-08 12:12:22

Thanks for such a great solution. Answer is yours.

Mystere Man 2009-02-08 21:32:14

You're welcome. Glad it did the job for you.

Noldorin 2009-02-08 22:09:21

I am interested in using this technique, but am not familiar with iterators/IEnumerable. Can you please provide an example usage? Thanks.

strager 2009-02-15 21:52:21

I point you to the MSDN docs on C# iterators: http://msdn.microsoft.com/en-us/library/65zzykke(VS.80).aspx. Hopefully that should make some sense.

Noldorin 2009-02-15 23:50:51

@Noldorin, I read up on iterators, and understand how they work now. How would I connect my Stream instance to your ASyncStreamProcessor class, though?

strager 2009-02-16 00:03:00

You call BeginRead/EndRead in a loop, calling the Process method on each callback. Read up on using asynchronous methods of TcpClient - it's not too complicated.

Noldorin 2009-02-16 12:32:25

Ah, thanks Noldorin. Works. =]

strager 2009-02-16 23:16:12

Noldorin, nice idea and good work on the class but you should really remove the IDisposable (unnecessary since it only manages a StringBuilder) and otherwise definitely remove the GC.Collect.

Henk Holterman 2009-03-01 17:50:45

Yeah, so you may have a point there. I just thought it would be advisable to deallocate such a potentially large chunk of memory. Perhaps it would be better to use the solution in http://dotnettipoftheday.org/tips/dispose_stringbuilder.aspx or simply not bother even. Thanks for the comment.

Noldorin 2009-03-01 18:14:22

Yes, instead of _buffer = null; you could set _buffer.Length = 0; this would recycle your (1) StringBuilder and be more efficient in all cases.

Henk Holterman 2009-03-01 21:31:48

ansaurus

tags:

views:

answers:

What is a good method to handle line based network I/O streams?

related questions