tags:

views:

110

answers:

6

I often have shell programming tasks where I run into this pattern:

cat file | some_script > file

This is unsafe - cat may not have read in the entire file before some_script starts writing to it. I don't really want to write the result to a temporary file (its slow, and I don't want the added complication of thinking up a unique new name).

Perhaps, there is there is a standard shell command that will buffer a whole stream until EOF is reached? Something like:

cat file | bufferUntilEOF | script > file

Ideas?

A: 

Using a temporary file is IMO better than attempting to buffer the data in the pipeline.

It almost defeats the purpose of pipelines to buffer them.

John Weldon
Well, maybe. Sounds like a religious argument though. I know all of the files easily fit within a tiny portion of the main memory (my shell script will operate over each source file in a very large SVN repository). The temporary file will make it run twice as slow as necessary (at least within Cygwin).
That may be. If your code is going to always be used in the way you expect it to, then it makes sense to make judicious trade-offs...
John Weldon
@stuartreynolds: Using a temporary file will NOT make it run slower, except perhaps for some negligible constant time for renaming the file back to its original name.
Juliano
That's a good point... If you're doing copy's instead of renames then you may be taking an unnecessary performance hit.
John Weldon
@Juliano. I've found renaming files to be VERY slow in windows (Cygwin). (Many many times slower than Linux on the same machine). As a rule of thumb, I try to avoid it if operating over very many files and if the user is waiting while the script runs.
+1  A: 

You're looking for sponge.

chazomaticus
That looks like a good solution except that I don't want to require all the users of my scripts to install additional dependencies (or compile any code). -Isn't there an alternative using standard utilities or builtin shell features?
I don't recommend sponge. If any command in your pipeline (other than sponge) fails (for example, due to syntax error, invalid arguments, etc.), it erases the file, and you end without both the original and the destination file.
Juliano
/tmp may be mounted in memory (at least under Linux). In this case I would hope that this could be really fast. Not sure about /tmp in Cygwin though. Does Cygwin hold that in memory?
@Julinao - The problem is not sponge, its the shell i) sponge < file > file , causes file to be truncated. Similarly cat file | b | c | sponge > file , also gets truncated. Bash truncates the file before sponge gets to see the input. ii) cat file | sponge file , works fine.
+4  A: 

Using a temporary file is the correct solution here. When you use a redirection like '>', it is handled by the shell, and no matter how many commands are in your pipeline, the shell is free to delete and overwrite the output file before any command is executed (during pipeline setup).

Juliano
+1  A: 

Using mktemp(1) or tempfile(1) saves you the expense of having to think up unique filename.

Dennis Williamson
vote up, excellent tool(s).
Anders
A: 

I think that the best way is to use a temp file but if you want another approach, you can use something like awk to buffer up the input into memory before your application starts receiving input. The following script will buffer the all of the input into the lines array before it starts to output it to the next consumer in the pipeline.

{ lines[NR] = $0; }
END {
    for (line_no=1; line_no<NR; ++line_no) {
        print lines[line_no];
    }
}

You can collapse it into a one-liner if you want:

cat file | awk '{lines[NR]=$0;} END {for(i=1;i<NR;++i) print lines[i];}' > file

With all of that, I would still recommend using a temporary file for the output and then overwriting the original file with it.

D.Shawley
+3  A: 

Like many others, I like to use temporary files. I use the shell process-id as part of the temporary name so that if multiple copies of the script are running at the same time, they won't conflict. Finally, I then only overwrite the original file if the script succeeds (using boolean operator short circuiting - it's a little dense but very nice for simple command lines). Putting that altogether, it would like like

some_script < file > smscrpt.$$ && mv smscrpt.$$ file

This will leave the temporary file if the command fails. If you want to clean up on error, you can change that to:

some_script < file > smscrpt.$$ && mv smscrpt.$$ file || rm smscrpt.$$

BTW, I got rid of the poor use of cat and replaced it will input redirection.

R Samuel Klatchko
@stuartreynolds - someone else posted about sponge and you rejected that because it's not standard. There is nothing standard that does what you want prefer.
R Samuel Klatchko
@klatchko - I think something like Sponge *is* the answer I'm looking for (with the caveats I mentioned -- its not really easy for me to use it widely). IMO, if there's really nothing that does what sponge does, *and* the functionality of sponge is fundamental to shell scripting (buffering to avoid file corruption sounds pretty fundamental to me), then probably it should be part of bash, or the standard GNU toolset (in which case I'm hoping someone will point out why we don't need sponge at all... anyone?). Do I *really* have to make a temporary file to do this?
@stuartreynolds - if you want something standard as of today, then yes, you need temporary files. I disagree that buffering is fundamental because you get your necessary behavior with temporary files (and given how command lines work, temporary files are better because you can preserve your original file if there's an error). Finally, if Cygwin is so broken that file rename is too slow, that is the issue that should be fixed.
R Samuel Klatchko