views:

185

answers:

4

Hi

There is a question I have been wondering about for ages and I was hoping someone could give me an answer to rest my mind.

Let's assume that I have an input stream (like a file/socket/pipe) and want to parse the incoming data. Let's assume that each block of incoming data is split by a newline, like most common internet protocols. This application could just as well be parsing html, xml or any other smart data structure. The point is that the data is split into logical blocks by a delimiter rather than a fixed length. How can I buffer the data to wait for the delimiter to appear?

The answer seems simple enough: just have a large enough byte/char array to fit the entire thing.

But what if the delimiter comes after the buffer is full? This is actually a question about how to fit a dynamic block of data in a fixed size block. I can only really think of a few alternatives:

  1. Increase the buffer size when needed. This may require heavy memory reallocation, and may lead to resource exhaustion from specially crafted streams (or perhaps even denial of service in the case of sockets where we want to protect ourselves against exhaustion attacks and drop connections that try to exhaust resources...and an attacker starts sending fake, oversized, packets to trigger the protection).

  2. Start overwriting old data by using a circular buffer. Perhaps not an ideal method since the logical block would become incomplete.

  3. Dump new data when the buffer is full. However, this way the delimiter will never be found, so this choice is obviously not a good option.

  4. Just make the fixed size buffer damn large and assume all incoming logical data blocks is within its bounds...and if it ever fills, just interpret the full buffer as a logical block...

In either case I feel we must assume that the logical blocks will never exceed a certain size...

Any thoughts on this topic? Obviously there must be a way since the higher level languages offer some sort of buffering mechanisms with their readLine() stream methods.

Is there any "best way" to solve this or is there always a tradeoff? I really appreciate all thoughts and ideas on this topic since this question has been haunting me everytime I have needed to write a parser of some sort.

A: 

Options (2) and (3) are out as you are losing data in both cases. Option (4) of a huge fixed size buffer would not solve the problem as it is just not possible to know what size is large enough? Is it all the physical memory + swap space + the free space available in all disks everywhere in the known universe?

Resizing the buffer looks like the best solution. Say realloc to twice the size and continue writing. There is always a chance of a specially constructed stream like a DoS that tries to bring down the system. My first thought was so set an arbitrarily large size as the max_size for the buffer. However, if we could do that, we could just set that as the size of the large buffer. So, resizing the buffer looks like the best option to me.

Arvind
A: 
  1. If the protocol or you do not define a upper bound for the length of each block then I don't see how you can prevent memory exhaustion edge cases.

  2. Assuming that there is an upper bound using a fixed size block seems like a good approach for reasonably sized limits.

  3. If the limits are high enough that a single fixed buffer will be inefficient then I would suggest using a data structure that is implemented internally as a linked list of fixed size buffers.

Sijin
A: 

Why do you have to wait to start processing?

Generally alternative 4 is sound. It doesn't however, require an "assumption", rather a definition. You simply declare that blocks are smaller than 8K and be done with it. It's not difficult to do.

Further, there's alternative 5: Start processing partial buffers. This works unless you have designed a truly pathological protocol that sends critical data at the very end of the block.

HTML, XML, JSON/YAML, etc., can all be parsed incrementally. You don't require a delimeter to do useful processing.

S.Lott
+1  A: 

There are normally two techniques for this

1) What I think readline uses - if the buffer fills return the data without the delimiter on the end

2) When the buffer fills, remeber it filled, keep reading until you get the delimiter and report an error (or truncate the record at the buffer size)