views:

105

answers:

4

I'm trying to search for a certain string in a lot of gziped csv files, the string is located at the first row and my thought was to get the first row of each file by combining find, zcat and head. But I can't get them to work together.

$find . -name "*.gz" -print | xargs zcat -f | head -1
20051114083300,1070074.00,0.00000000
xargs: zcat: terminated by signal 13

example file:
$zcat 113.gz | head
20050629171845,1069335.50,-1.00000000
20050629171930,1069315.00,-1.00000000
20050629172015,1069382.50,-1.00000000
 .. and 2 milion rows like these ...

Though I solved the problem by writing a bash script, iterating over the files and writing to a temp file, it would be great to know what I did wrong, how to do it, and if there might be other ways to go about it.

A: 
zcat -r * 2>/dev/null | awk -vRS= -vFS="\n" '{print $1}'
ghostdog74
+1  A: 

It worked as you asked it to.

head did its job, printed one line, and exited. zcat then running under the auspices of xargs tried to write to a closed pipe and received a fatal SIGPIPE for its efforts. Having its child die, xargs reported the whyfor.

To get the desired behaviour, you'd need to find -exec ... construction or a custom zhead to give to xargs.

added junk code I found behind the fridge:

#!/usr/bin/python

"""zhead - poor man's zcat file... | head -n
   no argument error checking, prefers to continue in the face of
   IO errors, with diagnostic to stderr

   sample usage: find ... | xargs zhead.py -1"""

import gzip
import sys

if sys.argv[1].startswith('-'):
    nlines = int(sys.argv[1][1:])
    start = 2
else:
    nlines = 10
    start = 1

for zfile in sys.argv[start:]:
    try:
        zin = gzip.open(zfile)
        for i in range(nlines):
            line = zin.readline()
            if not line:
                break
            print line,
    except Exception as err:
        print >> sys.stderr, zfile, err
    finally:
        try:
            zin.close()
        except:
            pass

It processed 10k files in /usr/share/man in about a minute.

msw
Good explanation, I wish I could upvote you, and I'll be back when I have reached 15reps.
furedde
Glad to be of help. Don't worry about the vote, that's not why I do it (and Dennis Williamson got my vote because it was better).
msw
+1  A: 

You should find that this will work:

find . -name "*.gz" | while read -r file; do zcat -f "$file" | head -n 1; done
Dennis Williamson
worked flawlessly, thank you. didn't know you could use while and read like that, i'll remember it.
furedde
A: 

If you have GNU Parallel http://www.gnu.org/software/parallel/ installed:

find . -name '*.gz' | parallel 'zcat {} | head -n1'

Watch the intro video to GNU Parallel at http://www.youtube.com/watch?v=OpaiGYxkSuQ

Ole Tange