views:

37

answers:

2

My situation is the following: a big (10GB) compressed file containing some files (~60) with a total uncompressed size of 150GB.

I would like to be able to slice big compressed log files into parts that have a certain number of lines in them (ie: 1 million).

I don't want to use split since it involves totally decompressing the original file, and i don't have that much disk space available.

What i am doing so far is this:

#!/bin/bash
SAVED_IFS=$IFS
IFS=$(echo -en "\n\b")
for file in `ls  *.rar` 
do
    echo Reading file: $file
    touch $file.chunk.uncompressed
    COUNTER=0
    CHUNK_COUNTER=$((10#000))
    unrar p $file while read line; 
    do
        echo "$line" >> $file.chunk.uncompressed
        let COUNTER+=1
        if [ $COUNTER -eq 1000000 ]; then
            CHUNK_COUNTER=`printf "%03d" $CHUNK_COUNTER;`
            echo Enough lines \($COUNTER\) to create a compressed chunk \($file.chunk.compressed.$CHUNK_COUNTER.bz2\)
            pbzip2 -9 -c $file.chunk.uncompressed > $file.chunk.compressed.$CHUNK_COUNTER.bz2
            #  10# is to force bash to count in base 10, so that 008+ are valid
            let CHUNK_COUNTER=$((10#$CHUNK_COUNTER+1))          
            let COUNTER=0
        fi  
    done
    #TODO need to compress lines in the last chunk too
done
IFS=$SAVED_IFS

What i don't like about it, is that i am limited by the speed of writing and then reading uncompressed chunks (~15MB/s). The speed of reading the uncompressed stram directly from the compressed file is ~80MB/s.

How can i adapt this script to stream directly a limited number of lines per chunk while directly writing to a compressed file?

A: 

If you don't mind wrapping the file in a tar file, than you can use tar to split and compress the file for you.

You can use tar -M --tape-length 1024 to create 1 megabyte files. Do note that after every 100 megabyte tar will ask you to press enter before it starts writing to the file again. So you will have to wrap it with your own script and move the resulting file before doing so.

WoLpH
Waht i don't like with this is that it forces it to be interactive.
elhoim
+1  A: 

You can pipe the output to a loop in which you use head to chop the files.

$ unrar p $file | ( while :; do i=$[$i+1]; head -n 10000 | gzip > split.$i.gz; done )

The only thing you have to work out still, is how to terminate the loop, since this will go on generating empty files. This is left as an excercise to the reader.

Zipping an empty file will give some output (for gz, it's 26 bytes) so you could test for that:

$ unrar p $file |
       ( while :; do
           i=$[$i+1];
           head -n 10000 | gzip > split.$i.gz;
           if [ `stat -c %s split.$i.gz` -lt 30 ]; then rm split.$i.gz; break; fi;
       done )
mvds