views:

35

answers:

2

I have a recurring task of splitting a set of large (about 1-2 GiB each) gzipped Apache logfiles into several parts (say chunks of 500K lines). The final files should be gzipped again to limit the disk usage.

On Linux I would normally do:

zcat biglogfile.gz | split -l500000

The resulting files files will be named xaa, xab, xac, etc So I do:

gzip x*

The effect of this method is that as an intermediate result these huge files are temporarily stored on disk. Is there a way to avoid this intermediate disk usage?

Can I (in a way similar to what xargs does) have split pipe the output through a command (like gzip) and recompress the output on the fly? Or am I looking in the wrong direction and is there a much better way to do this?

Thanks.

+1  A: 

There's zipsplit, but that uses the zip algorithm as opposed to the gzip algorithm.

Tony Miller
+3  A: 

A script like the following might suffice.

#!/usr/bin/perl
use PerlIO::gzip;

$filename = 'out';
$limit = 500000;

$fileno = 1;
$line = 0;

while (<>) {
    if (!$fh || $line >= $limit) { 
        open $fh, '>:gzip', "$filename_$fileno"; 
        $fileno++;
        $line = 0; 
    }
    print $fh $_; $line++;
}
ar
Thanks, your quick example helped me a lot. With two minor fixes (first line must start with #!/ and after the $fileno++ an additional $line=0 is needed) it worked good enough for my purposes.
Niels Basjes
Thanks. I've added those to the script for completeness.
ar