tags:

views:

942

answers:

5

I many times have to work with directories containing hundreds of thousands of files, doing text matching, replacing and so on. If I go the standard route of, say

grep foo *

I get the too many files error message, so I end up doing

for i in *; do grep foo $i; done

or

find ../path/ | xargs -I{} grep foo "{}"

But these are less than optimal (create a new grep process per each file).

This looks like more of a limitation in the size of the arguments programs can receive, because the * in the for loop works alright. But, in any case, what's the proper way to handle this?

PS: Don't tell me to do grep -r instead, I know about that, I'm thinking about tools that do not have a recursive option.

+2  A: 

xargs does not start a new process for each file. It bunches together the arguments. Have a look at the -n option to xargs - it controls the number of arguments passed to each execution of the sub-command.

camh
As Ry4an mentioned, using xargs -I turns batching off.
ephemient
A: 

I can't see that

for i in *; do
    grep foo $i
done

would work since I thought the "too many files" was a shell limitation, hence it would fail for the for loop as well.

Having said that, I always let xargs do the grunt-work of splitting the argument list into manageable bits thus:

find ../path/ | xargs grep foo

It won't start a process per file but per group of files.

paxdiablo
No, the "too many files" limitation is because the length of the arguments to the program being executed are restricted in size. "for i in *" never leaves the current shell to execute another program, so it can't hit this limitation.
ephemient
+5  A: 

If there is a risk of filenames containing spaces, you should remember to use the -print0 flag to find together with the -0 flag to xargs:

find . -print0 | xargs -0 grep -H foo
JesperE
I usually use `xargs -d '\n'` using newlines as the separators, since find outputs paths separated by newlines by default.
ephemient
+5  A: 

In newer versions of findutils, find can do the work of xargs (including the glomming behavior, such that only as many grep processes as needed are used):

find ../path -exec grep foo '{}' '+'

The use of + rather than ; as the last argument triggers this behavior.

Charles Duffy
Great tip! I never knew of the "+" option.
mhawke
A: 

Well, I had the same problems, but it seems that everything I came up with is already mentioned. Mostly, had two problems. Doing globs is expensive, doing ls on a million files directory takes forever (20+ minutes on one of my servers) and doing ls * on a million files directory takes forever and fails with "argument list too long" error.

find /some -type f -exec some command {} \;

seems to help with both problems. Also, if you need to do more complex operations on these files, you might consider to script your stuff into multiple threads. Here is a python primer for scripting CLI stuff. http://www.ibm.com/developerworks/aix/library/au-pythocli/?ca=dgr-lnxw06pythonunixtool&S_TACT=105AGX59&S_CMP=GR

Using find -exec grep foo ';' has the same problem as the original solution in that it execs an individual instance of grep for each file.
Charles Duffy