views:

28

answers:

1

Hello,

I asked the other day if data integrity (of flushed data) is kept even when there are more than one PIPEs streaming into localhost's STDIN. The answer is NO if the data flushed is large. http://stackoverflow.com/questions/3445047/data-integrity-question-when-collecting-stdouts-from-multiple-remote-hosts-over-s

But I would like to guarantee every line flushed on each end is passed to the single STDIN in full and won't be mixed up with data from other pipes. Is there any way to do so? How can that be done?

(Note that it can be done if I create multiple STDINs locally. But it is more convenient if I can process line streams through a single STDIN. So my question focuses on the case when there is only one STDIN at localhost with multiple (STDOUT) PIPEs into it.)

+1  A: 

This can be done via a congestion-backoff system like that used in Ethernet.

First, assign each pipe a unique delimiter. This delimiter cannot appear unescaped in the contents of any pipe. Now, use the following pseudocode:

  • Check for other process' delimiter; while an odd number of a single other process' delimiters is present, wait.
  • Write delimiter character.
  • Check if another process has also written an unmatched delimiter. If so, back off a random (increasing) amount and return to first step.
  • Write data.
  • Write delimiter character.

This will ensure that, although you will have some junk, every whole message will eventually get through.

Borealid
Thanks. I'm not familiar with the algorithm, but wouldn't it become much slower to process? Such algorithm would have to check presence of delimiters byte-by-byte and wait if necessary. It also loses the convenience of processing streams line-by-line because it has to keep track of previous input. (Btw, is the algorithm you described the congestion avoidance algorithm (http://en.wikipedia.org/wiki/TCP_congestion_avoidance_algorithm))?
OTZ
@otz: Yes, it can be slower, but if you have a hub you have to either choose a token-passing method or a backoff algorithm. To answer your other question, no, it's not the TCP algorithm, but rather closer to (as I mentioned) the Ethernet exponential backoff algorithm. See http://www.drusepth.net/content/ethernets-binary-exponential-backoff-algorithm .
Borealid
Another note, about token-passing: one way to do this is to create a deterministic ordering of the processes, and have each one only start after the previous one has finished.
Borealid
@Borealid Hmm. Very useful to know this algorithm, it seems. For this particular case, I wonder if there is any other approach that is more straight-forward (to implement) though. Perhaps the min buffer size in the network, if known, could be used to guarantee line-content integrity.
OTZ
@otz: I don't think there is a minimum write size on a pipe... The straightforward-to-implement approach is what I said above about ordering the processes' output.
Borealid
@Borealid .. which pretty much suggests your answer is THE answer here :) As I noted at the end of the question description, piping into separate stdins threaded locally (process1.stdin, process2.stdin, ...) is another option. Haven't decided which way to go, but the Ethernet exponential backoff algorithm seems to be worth trying once.
OTZ