views:

85

answers:

2

Greetings,

I've taken over from a prior team and writing ETL jobs which process csv files. I use a combination of shell scripts and perl on ubuntu. The csv files are huge; they arrive as zipped archives. Unzipped, many are more than 30Gb - yes, that's a G

Legacy process is a batch job running on cron that unzips each file entirely, reads and copies the first line of it into a config file, then re-zips the entire file. Some days this takes many many hours of processing time, for no benefit.

Can you suggest a method to only extract the first line (or first few lines) from each file inside a zipped archive, without fully unpacking the archives?

+1  A: 

Python's zipfile.ZipFile allows you to access archived files as streams via ZipFile.open(). From there you can process them as necessary.

Ignacio Vazquez-Abrams
+5  A: 

The unzip command line utility has a -p option which dumps a file to standard out. Just pipe that into head and it'll not bother extracting the whole file to disk.

Alternatively, from perldoc IO::Compress::Zip:

my ($status, $bufferRef);
my $member = $zip->memberNamed( 'xyz.txt' );
$member->desiredCompressionMethod( COMPRESSION_STORED );
$status = $member->rewindData();
die "error $status" unless $status == AZ_OK;
while ( ! $member->readIsDone() )
{
   ( $bufferRef, $status ) = $member->readChunk();
   die "error $status" if $status != AZ_OK && $status != AZ_STREAM_END;
   # do something with $bufferRef:
   print $$bufferRef;
}
$member->endRead();

Modify to suit, i.e. by iterating over the file list $zip->memberNames(), and only reading the first few lines.

Alnitak
`unzip -p filename.zip | head -1 >> headers.txt` works FLAWLESSLY thank you so much
iconridge