views:

1769

answers:

3

Okay so I have some data streams compressed by python's (2.6) zlib.compress() function. When I try to decompress them, some of them won't decompress (zlib error -5, which seems to be a "buffer error", no idea what to make of that). At first, I thought I was done, but I realized that all the ones I couldn't decompress started with 0x78DA (the working ones were 0x789C), and I looked around and it seems to be a different kind of zlib compression -- the magic number changes depending on the compression used. What can I use to decompress the files? Am I hosed?

+6  A: 

According to RFC 1950 , the difference between the "OK" 0x789C and the "bad" 0x78DA is in the FLEVEL bit-field:

  FLEVEL (Compression level)
     These flags are available for use by specific compression
     methods.  The "deflate" method (CM = 8) sets these flags as
     follows:

        0 - compressor used fastest algorithm
        1 - compressor used fast algorithm
        2 - compressor used default algorithm
        3 - compressor used maximum compression, slowest algorithm

     The information in FLEVEL is not needed for decompression; it
     is there to indicate if recompression might be worthwhile.

"OK" uses 2, "bad" uses 3. So that difference in itself is not a problem.

To get any further, you might consider supplying the following information for each of compressing and (attempted) decompressing: what platform, what version of Python, what version of the zlib library, what was the actual code used to call the zlib module. Also supply the full traceback and error message from the failing decompression attempts. Have you tried to decompress the failing files with any other zlib-reading software? With what results? Please clarify what you have to work with: Does "Am I hosed?" mean that you don't have access to the original data? How did it get from a stream to a file? What guarantee do you have that the data was not mangled in transmission?

UPDATE Some observations based on partial clarifications published in your self-answer:

You are using Windows. Windows distinguishes between binary mode and text mode when reading and writing files. When reading in text mode, Python 2.x changes '\r\n' to '\n', and changes '\n' to '\r\n' when writing. This is not a good idea when dealing with non-text data. Worse, when reading in text mode, '\x1a' aka Ctrl-Z is treated as end-of-file.

To compress a file:

# imports and other superstructure left as a exercise
str_object1 = open('my_log_file', 'rb').read()
str_object2 = zlib.compress(str_object1, 9)
f = open('compressed_file', 'wb')
f.write(str_object2)
f.close()

To decompress a file:

str_object1 = open('compressed_file', 'rb').read()
str_object2 = zlib.decompress(str_object1)
f = open('my_recovered_log_file', 'wb')
f.write(str_object2)
f.close()

Aside: Better to use the gzip module which saves you having to think about nasssties like text mode, at the cost of a few bytes for the extra header info.

If you have been using 'rb' and 'wb' in your compression code but not in your decompression code [unlikely?], you are not hosed, you just need to flesh out the above decompression code and go for it.

Note carefully the use of "may", "should", etc in the following untested ideas.

If you have not been using 'rb' and 'wb' in your compression code, the probability that you have hosed yourself is rather high.

If there were any instances of '\x1a' in your original file, any data after the first such is lost -- but in that case it shouldn't fail on decompression (IOW this scenario doesn't match your symptoms).

If a Ctrl-Z was generated by zlib itself, this should cause an early EOF upon attempted decompression, which should of course cause an exception. In this case you may be able to gingerly reverse the process by reading the compressed file in binary mode and then substitute '\r\n' with '\n' [i.e. simulate text mode without the Ctrl-Z -> EOF gimmick]. Decompress the result. Edit Write the result out in TEXT mode. End edit

UPDATE 2 I can reproduce your symptoms -- with ANY level 1 to 9 -- with the following script:

import zlib, sys
fn = sys.argv[1]
level = int(sys.argv[2])
s1 = open(fn).read() # TEXT mode
s2 = zlib.compress(s1, level)
f = open(fn + '-ct', 'w') # TEXT mode
f.write(s2)
f.close()
# try to decompress in text mode
s1 = open(fn + '-ct').read() # TEXT mode
s2 = zlib.decompress(s1) # error -5
f = open(fn + '-dtt', 'w')
f.write(s2)
f.close()

Note: you will need a use a reasonably large text file (I used an 80kb source file) to ensure that the decompression result will contain a '\x1a'.

I can recover with this script:

import zlib, sys
fn = sys.argv[1]
# (1) reverse the text-mode write
# can't use text-mode read as it will stop at Ctrl-Z
s1 = open(fn, 'rb').read() # BINARY mode
s1 = s1.replace('\r\n', '\n')
# (2) reverse the compression
s2 = zlib.decompress(s1)
# (3) reverse the text mode read
f = open(fn + '-fixed', 'w') # TEXT mode
f.write(s2)
f.close()

NOTE: If there is a '\x1a' aka Ctrl-Z byte in the original file, and the file is read in text mode, that byte and all following bytes will NOT be included in the compressed file, and thus can NOT be recovered. For a text file (e.g. source code), this is no loss at all. For a binary file, you are most likely hosed.

Update 3 [following late revelation that there's an encryption/decryption layer involved in the problem]:

The "Error -5" message indicates that the data that you are trying to decompress has been mangled since it was compressed. If it's not caused by using text mode on the files, suspicion obviously(?) falls on your decryption and encryption wrappers. If you want help, you need to divulge the source of those wrappers. In fact what you should try to do is (like I did) put together a small script that reproduces the problem on more than one input file. Secondly (like I did) see whether you can reverse the process under what conditions. If you want help with the second stage, you need to divulge the problem-reproduction script.

John Machin
+1 Very thorough answer. I'm having the same problem, though the buffer is not being read from a file, so binary mode is not an issue. Even so, this answer was a great help in understanding the meaning of the error.
DNS
A: 

Okay sorry I wasn't clear enough. This is win32, python 2.6.2. I'm afraid I can't find the zlib file, but its whatever is included in the win32 binary release. And I don't have access to the original data -- I've been compressing my log files, and I'd like to get them back. As far as other software, I've naievely tried 7zip, but of course it failed, because it's zlib, not gzip (I couldn't any software to decompress zlib streams directly). I can't give a carbon copy of the traceback now, but it was (traced back to zlib.decompress(data)) zlib.error: Error: -3. Also, to be clear, these are static files, not streams as I made it sound earlier (so no transmission errors). And I'm afraid again I don't have the code, but I know I used zlib.compress(data, 9) (i.e. at the highest compression level -- although, interestingly it seems that not all the zlib output is 78da as you might expect since I put it on the highest level) and just zlib.decompress().

PLEASE edit your question to incorporate these clarifications. Some points that still need clarification: (1) question says error -FIVE, pseudo-answer says -THREE (2) Why can't you give a "carbon copy" of the of the traceback now? Have you deleted the compressed files also?? (3) try to recall what your vanished compression code did with the str object returned by zlib.compress() (4) try to recall how your vanished decompression code obtained a str object to feed to zlib.decompress()
John Machin
A: 

Ok sorry about my last post, I didn't have everything. And I can't edit my post because I didn't use OpenID. Anyways, here's some data:

1) Decompression traceback:

Traceback (most recent call last):
  File "<my file>", line 5, in <module>
    zlib.decompress(data)
zlib.error: Error -5 while decompressing data

2) Compression code:

#here you can assume the data is the data to be compressed/stored
data = encrypt(zlib.compress(data,9)) #a short wrapper around PyCrypto AES encryption
f = open("somefile", 'wb')
f.write(data)
f.close()

3) Decompression code:

f = open("somefile", 'rb')
data = f.read()
f.close()

zlib.decompress(decrypt(data)) #this yeilds the error in (1)
See my update of my answer.
John Machin