tags:

views:

411

answers:

6

I've got a job running on my server at the command line prompt for a two days now:

find data/ -name filepattern-*2009* -exec tar uf 2009.tar {} ;

It is taking forever, and then some. Yes, there are millions of files in the target directory. (Each file is a measly 8 bytes in a well hashed directory structure.) But just running...

find data/ -name filepattern-*2009* -print > filesOfInterest.txt

...takes only two hours or so. At the rate my job is running, it won't be finished for a couple of weeks.. That seems unreasonable. Is there a more efficient to do this? Maybe with a more complicated bash script?

A secondary questions is "why is my current approach so slow?"

+6  A: 

There is xargs for this:

find data/ -name filepattern-*2009* -print0 | xargs -0 tar uf 2009.tar

Guessing why it is slow is hard as there is not much information. What is the structure of the directory, what filesystem do you use, how it was configured on creating. Having milions of files in single directory is quite hard situation for most filesystems.

Michal Čihař
The directory is hashed nicely. ext3, btw. As I mentioned, the find command alone runs quickly so I believe the file system, the directory structure, etc is not an issue.
Stu Thompson
I think you'll have to add `--max-args=n` (short `-n n`) where `n` is the maximum number of arguments tar (or any other program) can take. `getconf ARG_MAX` should show how high this limit is (131,072 on my machine). It's possible though that xargs takes care of this itself.
sfussenegger
Wow! So I ran another command with `xargs` like you said 15 minutes ago, and the resulting tar file is already 25% of the size of my original command. Thanks.
Stu Thompson
@Stu no "arg list too long" error? if no, I would have been overcautious ... once again :)
sfussenegger
@Stu Hey, it's me again :) You could simply replace the `;` in your original command with a `+` to get the exact same effect. Just see the corresponding man page entry on -exec
sfussenegger
@sfussenegger No errors (yet.) 273,838 files `tar`'d and counting. RHEL4 64-bit. `getconf ARG_MAX` reports 131k like you.
Stu Thompson
@Stu it would have failed immediately anyway. For instance, this happens when you do `tar -uf 2009.tar filepattern-*2009*` in a directory with 132k+ files.
sfussenegger
xargs knows what is maximal number of args to pass, it is it's purpose, you need --max-args only if you want to pass less
Michal Čihař
Turns out that, while `xargs` is faster than my first approach, running tar with an input list via `-T` is much, much faster than both.
Stu Thompson
+2  A: 

The way you currently have things, you are invoking the tar command every single time it finds a file, which is not surprisingly slow. Instead of taking the two hours to print plus the amount of time it takes to open the tar archive, see if the files are out of date, and add them to the archive, you are actually multiplying those times together. You might have better success invoking the tar command once, after you have batched together all the names, possibly using xargs to achieve the invocation. By the way, I hope you are using 'filepattern-*2009*' and not filepattern-*2009* as the stars will be expanded by the shell without quotes.

Michael Aaron Safyan
+6  A: 

One option is to use cpio to generate a tar-format archive:

$ find data/ -name "filepattern-*2009*" | cpio -ov --format=ustar > 2009.tar

cpio works natively with a list of filenames from stdin, rather than a top-level directory, which makes it an ideal tool for this situation.

Matthew Mott
+1  A: 

Here's a find-tar combination that can do what you want without the use of xargs or exec (which should result in a noticeable speed-up):

tar --version    # tar (GNU tar) 1.14 

# FreeBSD find (on Mac OS X)
find -x data -name "filepattern-*2009*" -print0 | tar --null --no-recursion -uf 2009.tar --files-from -

# for GNU find use -xdev instead of -x
gfind data -xdev -name "filepattern-*2009*" -print0 | tar --null --no-recursion -uf 2009.tar --files-from -

# added: set permissions via tar
find -x data -name "filepattern-*2009*" -print0 | \
    tar --null --no-recursion --owner=... --group=... --mode=... -uf 2009.tar --files-from -
bashfu
+1  A: 

If you already did the second command that created the file list, just use the -T option to tell tar to read the files names from that saved file list. Running 1 tar command vs N tar commands will be a lot better.

frankc
After running with `xargs` for a while, I tried this approach...and it was **much** faster!
Stu Thompson
A: 

To correctly handle file names with weird (but legal) characters (such as newlines, ...) you should write your file list to filesOfInterest.txt using find's -print0:

find -x data -name "filepattern-*2009*" -print0 > filesOfInterest.txt
tar --null --no-recursion -uf 2009.tar --files-from filesOfInterest.txt 
bashfu