ansaurus

Question

How to concat two or more gzip files/streams

Answer 1

+2 A:

If taring them is not out of the question (since the linked cat solution isn't viable for you):

tar cf A_B.gz.tar A.gz B.gz

Then, to get them back:

tar xf A_B.gz.tar

Mark Jones 2009-07-17 13:41:06

No, I'm not talking about tar at all

Artyom 2009-07-17 14:26:19

Mark Jones 2009-07-17 15:44:26

What's wrong with tarring them, it achieves everything you want to do.

Martin York 2009-07-17 16:35:26

Apparently his code can't handle two gzip'ed files. He wants to put the two files together and have one gzip'ed file, without decompressing the two original files.

Mark Jones 2009-07-17 16:47:33

Answer 2

+6 A:

Look at the RFC1951 and RFC1952

The format is simply a suites of members, each composed of three parts, an header, data and a trailer. The data part is itself a set of chunks with each chunks having an header and data part.

To simulate the effect of gzipping the result of the concatenation of two (or more files), you simply have to adjust the headers (there is a last chunk flag for instance) and trailer correctly and copying the data parts.

There is a problem, the trailer has a CRC32 of the uncompressed data and I'm not sure if this one is easy to compute when you know the CRC of the parts.

Edit: the comments in the gzjoin.c file you found imply that, while it is possible to compute the CRC32 without decompressing the data, there are other things which need the decompression.

AProgrammer 2009-07-17 13:57:57

If you have CRCs for pieces, you could use them to compute the final CRC.If I am not mistaken, if you have Msg1 with Crc1 and Msg2 with Crc2, then to compute crc of [Msg1,Msg2] you may instead compute crc of [Crc1, 0,0,0,0 ...(zeroes Msg2 length times)] and xor it with Crc2. May be one's complement will be required somewhere but the idea is this.

eugensk00 2009-07-29 12:50:05

Answer 3

+2 A:

It seems that the original compression of the individual files is done by you. It also seems that the desired result (concatenation of several pieces) is small enough to be sent to a web browser in one page. In that case your efficiency concerns seem to be unwarranted.

Please note that (1) the gzjoin.c approach is highly likely to be the best answer that you could get to your question as stated (2) it is complicated microsurgery performed by one of the gzip originators and may not have been subject to extensive stress testing.

Please consider a boring understandable reliable approach: storing the original pieces UNcompressed, then select required pieces, and concatenate and compress them. Note that the compression ratio may be better than that obtained by glueing together small compressed pieces.

John Machin 2009-07-26 04:35:18

Yes, I'm the originator of the two chunks, so I even can save some meta-data with them, or make some assumptions. So I understand that gzjoin is simplest and less error-prone, but it is still only 4 times faster then simple "gzip -1". I need memcpy near speedup. The idea: I cache some ready chunks and combine them per user request.

Artyom 2009-07-26 07:06:40

You haven't explained why you need "memcpy near speedup" on what seems to be a smallish amount of data. Perhaps you could tell us how many of these pages you need to serve per second and how big they are.

John Machin 2009-07-26 14:30:55

Let's assume the pages and chunks are big and the load is extremelyhigh.

Artyom 2009-07-28 07:03:30

Answer 4

+2 A:

The gzip manual says that two gzip files can be concatenated as you attempted.

http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage

So it appears that the other tools may be broken. As seen in this bug report. http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=97263

Apart from filing a bug report with each one of the browser makers, and hoping they comply, perhaps your program can cache the most common concatenations of the required data.

As others have mentioned you may be able to perform surgery: http://www.gzip.org/zlib/rfc-gzip.html

And this requires a CRC-32 of the final uncompressed file. The required size of the uncompressed file can be easily calculated by adding the lengths of the individual sub-files.

In the bottom of the last link, there is code for calculating a running crc-32 named update_crc.

Calculating the crc on the uncompressed files each time your process is run, is probably cheaper than the gzip algorithm itself.

Juan 2009-07-28 18:01:08

ansaurus

tags:

views:

answers:

How to concat two or more gzip files/streams

related questions