My situation is the following: a big (10GB) compressed file containing some files (~60) with a total uncompressed size of 150GB.
I would like to be able to slice big compressed log files into parts that have a certain number of lines in them (ie: 1 million).
I don't want to use split since it involves totally decompressing the original file, and i don't have that much disk space available.
What i am doing so far is this:
#!/bin/bash
SAVED_IFS=$IFS
IFS=$(echo -en "\n\b")
for file in `ls *.rar`
do
echo Reading file: $file
touch $file.chunk.uncompressed
COUNTER=0
CHUNK_COUNTER=$((10#000))
unrar p $file while read line;
do
echo "$line" >> $file.chunk.uncompressed
let COUNTER+=1
if [ $COUNTER -eq 1000000 ]; then
CHUNK_COUNTER=`printf "%03d" $CHUNK_COUNTER;`
echo Enough lines \($COUNTER\) to create a compressed chunk \($file.chunk.compressed.$CHUNK_COUNTER.bz2\)
pbzip2 -9 -c $file.chunk.uncompressed > $file.chunk.compressed.$CHUNK_COUNTER.bz2
# 10# is to force bash to count in base 10, so that 008+ are valid
let CHUNK_COUNTER=$((10#$CHUNK_COUNTER+1))
let COUNTER=0
fi
done
#TODO need to compress lines in the last chunk too
done
IFS=$SAVED_IFS
What i don't like about it, is that i am limited by the speed of writing and then reading uncompressed chunks (~15MB/s). The speed of reading the uncompressed stram directly from the compressed file is ~80MB/s.
How can i adapt this script to stream directly a limited number of lines per chunk while directly writing to a compressed file?