views:

318

answers:

8

I'm scraping data from the web, and I have several processes of my scraper running in parallel.

I want the output of each of these processes to end up in the same file. As long as lines of text remain intact, I the order of the lines does not matter. In UNIX, can I just pipe the output of each process to the same file using the >> operator?

A: 

Yep, sounds fine.

If you are worried, pipe them into seperate files, then cat them altogether at the end.

Rich Bradshaw
It's really not fine but your suggestion of separate files has promise.
wcm
I was (incorrectly) assuming that the processes only wrote output on completion, and that they were unlikely to collide within the ms needed. Looks like I was wrong!
Rich Bradshaw
+13  A: 

No. It is not guaranteed that lines will remain intact. They can become intermingled.

From searching based on liori's answer I found this:

Write requests of {PIPE_BUF} bytes or less shall not be interleaved with data from other processes doing writes on the same pipe. Writes of greater than {PIPE_BUF} bytes may have data interleaved, on arbitrary boundaries, with writes by other processes, whether or not the O_NONBLOCK flag of the file status flags is set.

So lines longer than {PIPE_BUF} bytes are not guaranteed to remain intact.

Mark Byers
+1 Ha, I couldn't find a proper quote :-)
liori
+1  A: 

Briefly, no. >> doesn't respect multiple processes.

Brian Agnew
+2  A: 

You'll need to ensure that you're writing whole lines in single write operations (so if you're using some form of stdio, you'll need to set it for line buffering for at least the length of the longest line that you can output.) Since the shell uses O_APPEND for the >> redirection then all your writes will then automatically append to the file with no further action on your part.

Neil
+1  A: 

Use temporary files and concatenate them together. It's the only safe way to do what you want to do, and there will (probably) be negligible performance loss that way. If performance is really a problem, try making sure that your /tmp directory is a RAM-based filesystem and putting your temporary files there. That way the temporary files are stored in RAM instead of on a hard drive, so reading/writing them is near-instant.

Max E.
Temp files is a good idea.
Ink-Jet
+5  A: 

Generally, no.

On Linux this might be possible, as long as two conditions are met: each line is written in a one operation, and the line is no longer than PIPE_SIZE (usually the same as PAGE_SIZE, usually 4096). But... I wouldn't count on that; this behaviour might change.

It is better to use some kind of real logging mechanism, like syslog.

liori
A: 

In addition to the idea of using temporary files, you could also use some kind of aggregating process, although you would still need to make sure your writes are atomic.

Think Apache2 with piped logging (with something like spread on the other end of the pipe if you're feeling ambitious). That's the approach it takes, with multiple threads/processes sharing a single logging process.

Duncan
A: 

Definitely no, I had a log-management script where I assumed this worked, and it did work, until I moved it to an under-load production server. Not a good day... But basically you end up with sometimes completely mixed up lines.

If I'm trying to capture from multiple sources, it is much simpler (and easier to debug) having a multiple-file 'paper trails' and if I need an over-all log file, concatenate based on timestamp (you are using time-stamps, right?) or as liori said, syslog.

Andrew Bolster