tags:

views:

71

answers:

3

You can split a pipe using the tee command under linux as follows

printf "line1\nline2\nline3\n" | tee >(wc -l ) | (awk '{print "this is awk: "$0}')

which yields the output

this is awk: line1
this is awk: line2
this is awk: line3
this is awk: 3

My question, is that order of printing guaranteed? Will the tee split pipe that counts the number of lines always print at the end? Is there a way to always print it at the start? Or is the order of printing tee never guaranteed?

+1  A: 

I suspect that in this case, wc is waiting for EOF, and so it will not return (or print output) until the first command is done sending input, whereas awk acts line by line and so will always print first. I don't know if it's defined when sending to other processes.

Why not just have awk count the lines before printing the lines themselves?

Daenyth
+2  A: 

It is not defined by tee, but as Daenyth says, wc won't be finished until tee has finished passing it data - so usually tee will have passed it on to awk by then too. In this instance it might be better to have awk do the counting.

echo -ne {one,two,three,four}\\n | \
awk '{print "awk processing line " NR ": "$0} END {print "Awk saw " NR " lines"}'

The downside being that it won't know the number untils it finishes (knowing it requires buffering the data). In your example, both tee and wc have stdout connected to the same pipe (stdin for awk), but the order is undefined. cat (and most other piping tools) can be used to assemble files in a known order.

There are more advanced piping techniques that could be used, such as bash coprocesses (coproc) or named pipes (mkfifo or mknod p). The latter gets you names in the filesystem, which can be passed to other processes, but you'll have to clean them up and avoid collissions. tempfile or $$ may be useful for that. Pipes are not for buffering data, as they often have limited size and will simply block writes.

An example of where pipes are the wrong solution:

mkfifo wcin wcout
wc -l < wcin > wcout &
yes | dd count=1 bs=8M | tee wcin | cat -n wcout - | head

The problem here is that tee will get stuck trying to write things to cat, which wants to finish with wcout first. There's simply too much data for the pipe from tee to cat.

Edit regarding dmckee's answer: Yes, the order may be repeatable, but it is not guaranteed. It is a matter of scale, scheduling and buffer sizes. On this GNU/Linux box, the example starts breaking up after a few thousand lines:

seq -f line%g 20000 | tee >(awk '{print "*" $0 "*"}' ) | \
(awk '{print "this is awk: "$0}') | less
this is awk: line2397
this is awk: line2398
this is awk: line2*line1*
this is awk: *line2*
this is awk: *line3*
Yann Vernier
lol I'd ask you for a tutorial on making named pipes cause you seem to know your stuff but that may be asking too much :)
ldog
Yann Vernier
thanks so much, even though as you point out the named pipe solution may be a bad way to go it serves my purposes provided I stay below the limits !
ldog
A: 

I don't think that you can count on it. The wc here runs in a separate process, so there is no synchronization. My trial run suggests that it might be (at least in bash). As Daenyth explains, this particular case is special, but try it with grep -o line instead of wc and see what you get.

That said, on my MacBoox I get:

$ printf "line1\nline2\nline3\nline4\nline5\n" | tee >(grep -o line ) | (awk '{print "this is awk: "$0}')
this is awk: line1
this is awk: line2
this is awk: line3
this is awk: line4
this is awk: line5
this is awk: line
this is awk: line
this is awk: line
this is awk: line
this is awk: line

very consistently. I'd have to read the bash man page very closely to be sure.

Similarly:

$ printf "line1\nline2\nline3\nline4\nline5\n" | tee >(awk '{print "*" $0 "*"}' ) | (awk '{print "this is awk: "$0}')
this is awk: line1
this is awk: line2
this is awk: line3
this is awk: line4
this is awk: line5
this is awk: *line1*
this is awk: *line2*
this is awk: *line3*
this is awk: *line4*
this is awk: *line5*

everytime...and

$ printf "line1\nline2\nline3\nline4\nline5\n" | tee >(awk '{print "*" $0 "*"}' ) | (grep line)
line1
line2
line3
line4
line5
*line1*
*line2*
*line3*
*line4*
*line5*
dmckee