tags:

views:

188

answers:

2

Okey this is really strange. I have this script which basically downloads bunch of achieve files and extracts them. Usually those files are .zip files. Today I sat down and decided to make it work with rar files and I got stuck. At first I thought that the problem is in my unrar code, but it wasn't there. So I did:

f = urllib2.urlopen(file_location)
data = StringIO(f.read())
print data.getvalue()

heck I even did:

f = urllib2.urlopen(file_location)
print f.read()

because I just wanted to see the first chunk and the result is the same - I'm missing first line of the .rar file.

If I use web browser to download the very same file everything is fine, it's not corrupt.

Can anyone please explain me what the hell is going on here? And what does it have to do with file type.

+2  A: 

Does the data maybe contain a "carriage return" character ("\r") so that the first chunk is overwritten with subsequent data when you try to display it? This would explain why you don't see the first chunk in your output, but not why you aren't able to decode it later on.

sth
Thanks, I didn't even think about it. Dumped the output to text file and yep, there was CR. Thanks at least I'm not going in the wrong direction any more.
Maiku Mori
+3  A: 

When trying to determine the content of binary data string, use repr() or hex(). For example,

>>> print repr(data)
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t'
>>> print [hex(ord(c)) for c in data]
['0x0', '0x1', '0x2', '0x3', '0x4', '0x5', '0x6', '0x7', '0x8', '0x9']
>>>
gimel