tags:

views:

87

answers:

2

Let's say file.txt.gz has 2GB, and I want to see last 100 lines or so. zcat <file.txt.gz | tail -n 100 would go through all of it.

I understand that compressed files cannot be randomly accessed, and if I cut let's say the last 5MB of it, then data just after the cut will be garbage - but can gzip resync and decode rest of the stream?

If I understand it correctly gzip stream is a straightforward stream of commands describing what to output - it should be possible to sync with that. Then there's 32kB sliding window of the most recent uncompressed data - which starts as garbage of course if we start in the middle, but I'd guess it would normally get filled with real data quickly, and from that point decompression is trivial (well, it's possible that something gets recopied over and over again from start of file to the end, and so the sliding window never clears - it would surprise me if it was all that common - and if that happens we just process the whole file).

I'm not terribly eager to do this kin of gzip hackery myself - hasn't anybody done it before, for dealing with corrupted files if nothing else?

Alternatively - if gzip really cannot do that, are there perhaps any other stream compression programs that work pretty much like it, except they allow resyncing mid-stream?

EDIT: I found pure Ruby reimplementation of zlib and hacked it to print ages of bytes within sliding window. It turns out that things do get copied over and over again a lot and even after 5MB+ the sliding window still contains stuff from the first 100 bytes, and from random places throughout the file.

We cannot even get around that by reading the first few blocks and the last few blocks, as those first bytes are not referenced directly, it's just a very long chain of copies, and the only way to find out what it's referring to is by processing it all.

Essentially, with default options what I wanted is probably impossible.

On the other hand zlib has Z_FULL_FLUSH option that clears up this sliding window for purpose of syncing. So the question still stands. Assuming that zlib syncs every now and then, are there any tools for reading just the end of it without processing it all?

+1  A: 

Z_FULL_FLUSH emits a known byte sequence (00 00 FF FF) that you can use to synchronize. This link may be useful.

brool
A: 

This is the difference between block and stream ciphers. Because gzip is a stream cipher, you might need the whole file up to a certain point to decrypt the bytes at that point.

As you mention, when the window is cleared, you're golden. But there's no guarantee that zlib actually does this often enough for you... I suggest you seek backwards from the end of the file and find the marker for a full flush.

Borealid