views:

763

answers:

11

I saw the following interesting usage of tar in a co-worker's Bash scripts:

`tar cf - * | (cd <dest> ; tar xf - )`

Apparently it works much like rsync -av does, but faster. The question arises, how?

-m


EDIT: Can anyone explain why should this solution be preferable over the following?

cp -rfp * dest

Is the former faster?

+10  A: 

It writes the archive to standard output, then pipes it to a subprocess -- wrapped by the parentheses -- that changes to a different directory and reads/extracts from standard input. That's what the dash character after the f argument means. It's basically copying all the visible files and subdirectories of the current directory to another directory.

tvanfosson
+1 on the interpretation - The reason to do it: to copy a collection of files/directories from one location to another, sometimes across file systems or devices.
Ken Gentle
Approximately equivalent to: find . -print | cpio -pvdumB <dest>
Jonathan Leffler
wtf. what do those cpio parameters mean oO
Johannes Schaub - litb
i still can't see how it could be faster. i mean, the same stuff will be written to the FS. whether it was piped before or not doesn't matter does it? it would slow it down if anything. could someone shed some light on this?
Johannes Schaub - litb
maybe i should open a new question about this :)
Johannes Schaub - litb
Multiple processes vs. one process -- allows one process to continue when the other process is blocked awaiting I/O. Granted there is not a lot of non-I/O in this task but if you're going between different devices it would really help.
tvanfosson
+2  A: 

This is a unique usage of pipes. Basically, the first tar typically writes directly to a file, but instead it's going to write to stdout (the -), which is then redirected to the other tar which takes stdin rather than a file. Basically this is the same thing as tarring to a file and untarring later, except without the file in between.

Stefan Mai
Also, technically speaking 'stdout' _is_ a file, so there's really no trickery here at all... just another day at the office :)
thenduks
In Linux, everything is a file.
Kibbee
+1  A: 
tar cf - * | (cd <dest> ; tar xf - )

is going to tar all not hidden files/directories of the current directory to stdout, then piping that into a new subshells' stdin. That shell first changes the current working directory to <dest>, and then untars it to that directory.

Johannes Schaub - litb
A: 
tar cf - *

This uses tar to send * to stdout

|

This does the obvious redirect of stdout to...

(cd <dest> ; tar xf - )

This, which changes PWD to the appropriate location and then extracts from stdin

I do not know why this would be faster than rsync, as there is no compression involved.

Sparr
It would be faster because it is on a single machine (no network) and there is no compression, so the machine does less work.
Jonathan Leffler
Also because two processes are involved - one reading and one writing?
Alastair
+2  A: 

For a directory with 25,000 empty files:

$ time { tar -cf - * | (cd ../bar; tar -xf - ); }
real    0m4.209s
user    0m0.724s
sys 0m3.380s

$ time { cp * ../baz/; }
real    0m18.727s
user    0m0.644s
sys 0m7.127s

For a directory with 4 files of 1073741824 bytes (1GB) each

$ time { tar -cf - * | (cd ../bar; tar -xf - ); }
real    3m44.007s
user    0m3.390s
sys 0m25.644s

$ time { cp * ../baz/; }
real    3m11.197s
user    0m0.023s
sys 0m9.576s

My guess is this phenomenon is highly filesystem-dependent. If I'm right you will see a drastic difference between a filesystem that specializes in numerous small files, such as reiserfs 3.6, and a filesystem that is better at handling large files.

(I ran the above tests on HFS+.)

Good call, I hadn't considered that aspect. You are likely correct and my original statement is not to be taken as a generality.
fogus
I see a potential problem with the first test: Are you sure that globs aren't expanded in advance of "`time`" starting its timer when there is no pipeline? You should rerun the "`cp`" test with a useless "`cat`" just to be sure.
Teddy
+1  A: 

Some old versions of cp didn't have -f / -p (and similar) options for preserving permissions, so this tar trick did the job.

Andrew Medico
+5  A: 

On the difference between cp and tar to copy the directory hierarchies, a simple experiment can be conducted to show the difference:

alastair box:~/hack/cptest [1134]% mkdir src
alastair box:~/hack/cptest [1135]% cd src
alastair box:~/hack/cptest/src [1136]% touch foo
alastair box:~/hack/cptest/src [1137]% ln -s foo foo-s
alastair box:~/hack/cptest/src [1138]% ln foo foo-h
alastair box:~/hack/cptest/src [1139]% ls -a
total 0
-rw-r--r--  2 alastair alastair    0 Nov 25 14:59 foo
-rw-r--r--  2 alastair alastair    0 Nov 25 14:59 foo-h
lrwxrwxrwx  1 alastair alastair    3 Nov 25 14:59 foo-s -> foo
alastair box:~/hack/cptest/src [1142]% mkdir ../cpdest
alastair box:~/hack/cptest/src [1143]% cp -rfp * ../cpdest
alastair box:~/hack/cptest/src [1144]% mkdir ../tardest
alastair box:~/hack/cptest/src [1145]% tar cf - * | (cd ../tardest ; tar xf - )
alastair box:~/hack/cptest/src [1146]% cd ..
alastair box:~/hack/cptest [1147]% ls -l cpdest
total 0
-rw-r--r--  1 alastair alastair    0 Nov 25 14:59 foo
-rw-r--r--  1 alastair alastair    0 Nov 25 14:59 foo-h
lrwxrwxrwx  1 alastair alastair    3 Nov 25 15:00 foo-s -> foo
alastair box:~/hack/cptest [1148]% ls -l tardest
total 0
-rw-r--r--  2 alastair alastair    0 Nov 25 14:59 foo
-rw-r--r--  2 alastair alastair    0 Nov 25 14:59 foo-h
lrwxrwxrwx  1 alastair alastair    3 Nov 25 15:00 foo-s -> foo

The difference is in the hard-linked files. Notice how the hard-linked files are copied individually with cp and together with tar. To make the difference more obvious, have a look at the inodes for each:

alastair box:~/hack/cptest [1149]% ls -i cpdest
24690722 foo  24690723 foo-h  24690724 foo-s
alastair box:~/hack/cptest [1150]% ls -i tardest
24690801 foo  24690801 foo-h  24690802 foo-s

There are probably other reasons to prefer tar, but this is one big one, at least if you have extensively hard-linked files.

Alastair
nice, dude. would give you +10 :p
Johannes Schaub - litb
What you want to do is pass "`--archive`" to `cp`, that'll fix it. Assuming GNU `cp`, of course.
Teddy
A: 

The tar solution will preserve symbolic links, whereas cp will just make copies and destroy the links.

tar has been a standard Unix utility a lot longer than rsync. You're more likely to find it in a situation when a directory hierarchy needs to be copied to another location (even another computer). rsync is probably easier to use these days, but is slower because it compares both the source and destinations and sync's them. tar just copies in one direction.

Barry Brown
cp won't "destroy the [symbolic] links", see my answer above...
Alastair
A: 

If you have GNU cp (which all Linux-based systems will), the cp --archive will work, even on hard-linked files, and tar is not needed.

Teddy
A: 

As it happens, a co-worker wrote a nearly identical command into one of our scripts. After I spent some time puzzling over it, I asked why he had used that rather than cp. His answer, as I recall it, was that cp is slow when making a copy from one file system to another.

Whether or not this is true would require more testing than I care to spend on the question, but it makes a certain amount of sense. The first tar process reads from the source device as quickly as possible only waiting for that device to read. Meanwhile, the second tar process reads from its input pipe and writes as quickly as possible. It might have to wait for input, but if writes on the destination device are slower than reads on the source device it will only wait on the destination device. A single cp command will have to wait on both the source and the destination devices.

On the other hand, modern operating systems do a pretty good job of pre-caching IO operations. It's entirely possible cp will spend most of its time waiting on writes and getting reads from memory rather than the device itself. It seems like one would need really solid data to chose using two tar commands rather than the more straightforward cp command.

Jon Ericson
A: 

I believe the tar will do a Windows style 'merge' operation with deeply nested directories, whereas the cp will overwrite sub-directories.

For example if you have the layout:

dir/subdir/file1

and you copy it to a destination that contains:

dir/subdir/file2

Then with copy you will be left with:

dir/subdir/file1

But with the tar command, your destination will contain:

dir/subdir/file1
dir/subdir/file2
Singletoned