views:

918

answers:

3

So, this is a seemingly simple question, but I'm apparently very very dull. I have a little script that downloads all the .bz2 files from a webpage, but for some reason the decompressing of that file is giving me a MAJOR headache.

I'm quite a Python newbie, so the answer is probably quite obvious, please help me.

In this bit of the script, I already have the file, and I just want to read it out to a variable, then decompress that? Is that right? I've tried all sorts of way to do this, I usually get "ValueError: couldn't find end of stream" error on the last line in this snippet. I've tried to open up the zipfile and write it out to a string in a zillion different ways. This is the latest.

        openZip = open(zipFile, "r")

        s = ''

        while True:
            newLine = openZip.readline()
            if(len(newLine)==0):
                break
            s+=newLine

        print s

        uncompressedData = bz2.decompress(s)

Hi Alex, I should've listed all the other methods I've tried, as I've tried the read() way.

METHOD A: print 'decompressing ' + filename

              fileHandle = open(zipFile)
              uncompressedData = ''

              while True:            
                    s = fileHandle.read(1024)
                    if not s:
                      break
                    print('RAW "%s"', s)
                    uncompressedData += bz2.decompress(s)

              uncompressedData += bz2.flush()

              newFile = open(steamTF2mapdir + filename.split(".bz2")[0],"w")
              newFile.write(uncompressedData)
              newFile.close()

I get the error :
uncompressedData += bz2.decompress(s) ValueError: couldn't find end of stream

METHOD B zipFile = steamTF2mapdir + filename print 'decompressing ' + filename fileHandle = open(zipFile)

        s = fileHandle.read()

        uncompressedData = bz2.decompress(s)

Same error : uncompressedData = bz2.decompress(s) ValueError: couldn't find end of stream

Thanks so much for you prompt reply. I'm really banging my head against the wall, feeling inordinately thick for not being able to decompress a simple .bz2 file.

By the by, used 7zip to decompress it manually, to make sure the file isn't wonky or anything, and it decompresses fine.

+6  A: 

You're opening and reading the compressed file as if it was a textfile made up of lines. DON'T! It's NOT.

uncompressedData = bz2.BZ2File(zipFile).read()

seems to be closer to what you're angling for.

Edit: the OP has shown a few more things he's tried (though I don't see any notes about having tried the best method -- the one-liner I recommend above!) but they seem to all have one error in common, and I repeat the key bits from above:

opening ... the compressed file as if it was a textfile ... It's NOT.

open(filename) and even the more explicit open(filename, 'r') open, for reading, a text file -- a compressed file is a binary file, so in order to read it correctly you must open it with open(filename, 'rb'). ((my recommended bz2.BZ2File KNOWS it's dealing with a compressed file, of course, so there's no need to tell it anything more)).

In Python 2.*, on Unix-y systems (i.e. every system except Windows), you could get away with a sloppy use of open (but in Python 3.* you can't, as text is Unicode, while binary is bytes -- different types).

In Windows (and before then in DOS) it's always been indispensable to distinguish, as Windows' text files, for historical reason, are peculiar (use two bytes rather than one to end lines, and, at least in some cases, take a byte worth '\0x1A' as meaning a logical end of file) and so the reading and writing low-level code must compensate.

So I suspect the OP is using Windows and is paying the price for not carefully using the 'rb' option ("read binary") to the open built-in. (though bz2.BZ2File is still simpler, whatever platform you're using!-).

Alex Martelli
Hi Alex, thanks for your quick reply, please see the edits in my question that address your comment.
So the problem with this method, is although it seems to decompress it fine, when I try and run it, the file is corrupted somehow. When I open the .bz2 file with, say, 7zip, it runs fine.Moreover, the size of a properly extracted file (with 7zip), is 948kb, whereas the file extracted from my script is 952KB. I'm completely baffled.
7zip accepts many other formats besides bz2, maybe your file is in one of the other formats. If law and privacy don't impede that, put it up on a public URL, give me the URL, and I'll let you know what format it's in and how to decode it -- without having the file at hand, that's basically impossible for me to do.
Alex Martelli
Thanks for you all your help, Alex, sorry for being a bit thick here.That makes sense, but then why would the python bz2 library uncompress the file? Wouldn't it throw some sort of exception?I'll see if I can find a place to put the file (it's nothing illegal, it's just a .bz2 file of a Team Fortress2 map file (.bsp file), but I don't want to put the URL up publicly, as it's my friend's server that's hosting it. How can I check to see if the .bz2 file is actually in a bz2 file?
But you said it DID raise a exception -- a ValueError! I believe you can download a bzip2-only excutable at http://www.bzip.org/downloads.html -- if that exe decodes the file correctly, this should, I believe, prove it's a .bz2, and vice versa.
Alex Martelli
Sorry for my lack of clarity. uncompressedData = bz2.BZ2File(zipFile).read() does unzip everything and doesn't raise an exception. However, the .bsp file is still corrupted somehow, as specified. AH! I'll check out that exe and see if the file is indeed a bz2 file. Thanks again.
eurasian, make sure you are saving the file with the mode set to 'wb'. I played around and when I didn't write the uncompressed data using binary mode, my file had extra characters from the newlines.
Philip T.
Thanks Philip, that was exactly it, thanks for the extra bit of info.
Aha, so it's AGAIN the same issue my answer already mentions for the reading part, applying to the writing part just as much -- I had no idea that the bsp file is a binary one, too, but you need the b in the options on every binary file (and NO non-binary but text file).
Alex Martelli
+3  A: 

openZip = open(zipFile, "r")

If you're running on Windows, you may want to do say openZip = open(zipFile, "rb") here since the file is likely to contain CR/LF combinations, and you don't want them to be translated.

newLine = openZip.readline()

As Alex pointed out, this is very wrong, as the concept of "lines" is foreign to a compressed stream.

s = fileHandle.read(1024) [...] uncompressedData += bz2.decompress(s)

This is wrong for the same reason. 1024-byte chunks aren't likely to mean much to the decompressor, since it's going to want to work with it's own block-size.

s = fileHandle.read() uncompressedData = bz2.decompress(s)

If that doesn't work, I'd say it's the new-line translation problem I mentioned above.

Martin
Thanks, this was very helpful.
Alex Reynolds
A: 

This was very helpful. 44 of 2300 files gave an end of file missing error, on Windows open. Adding the b(inary) flag to open fixed the problem.

for line in bz2.BZ2File(filename, 'rb', 10000000) :

works well. (the 10M is the buffering size that works well with the large files involved)

Thanks!

Jon L ehto