ansaurus

Question

Python line file iteration and strange characters

Answer 1

+1 A:

I asked (in a comment) """Show us the output from print repr(weird_special_characters). When you open the file in vim, WHAT are correct? Please be more precise than "formatted weirdly".""" But nothing :-(

What file are you looking at with od? file.gz?? If you can see anything recognisable in there, it's not a gzip file! You're not seeing newlines, you're seeing binary bytes that contain 0x0A.

If the original file was utf-8 encoded, what was the point of trying it with other codecs?

Does "works OK with zcat" mean that you got recognisable data without a utf8 decode step??

I suggest that you simplify your code, and do it a step at a time ... see for example the accepted answer to this question. Try it again and please show the exact code that you ran, and use repr() when describing the results.

Update It looks like DS has guessed what you were trying to explain about the \x1c and \x1d.

Here are some notes on WHY it happens like that:

In ASCII, only \r and \n are considered when line-breaking:

>>> import pprint
>>> text = ''.join('A' + chr(i) for i in range(32)) + 'BBB'
>>> print repr(text)
'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\nA\x0bA\x0cA\rA\x0eA\x0fA\x10
A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x1dA\x1eA\x1fBBB'
>>> pprint.pprint(text.splitlines(True))
['A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\n', # line break
 'A\x0bA\x0cA\r', # line break
 'A\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x
1dA\x1eA\x1fBBB']
>>>

However in Unicode, the characters \x1D (FILE SEPARATOR), \x1E (GROUP SEPARATOR), and \x1E (RECORD SEPARATOR) also qualify as line-endings:

>>> text = u''.join('A' + unichr(i) for i in range(32)) + u'BBB'
>>> print repr(text)
u'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\nA\x0bA\x0cA\rA\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x1dA\x1eA\x1fBBB'
>>> pprint.pprint(text.splitlines(True))
[u'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\n', # line break
 u'A\x0bA\x0cA\r', # line break
 u'A\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1c', # line break
 u'A\x1d', # line break
 u'A\x1e', # line break
 u'A\x1fBBB']
>>>

This will happen whatever codec you use. You still need to work out what (if any) codec you need to use. You also need to work out whether the original file was really a text file and not a binary file. If it's a text file, you need to consider the meaning of the \x1c and \x1d in the file.

John Machin 2010-04-30 02:28:17

I'm looking at the uncompressed file (specifically the offending portion) with od. That's where the '\n' appears. I believe it is actually a 2-byte character which, when represented as 2 individual bytes, has the '\n'. Unfortunately I don't know what the original encoding is, but for some reason vim and zcat represent it correctly, meaning they are detecting it.

muckabout 2010-05-01 09:33:03

"The uncompressed file"??? Your code doesn't show you creating a file. Please show the code that you actually ran that created the file that you are examining with `od`. Please show a hex dump from od of say the first 100 bytes of the file. Please show a hex dump from od of say 100 bytes centred on the offending `\n'. Please show all of that by editing your question.

John Machin 2010-05-01 09:59:16

The compressed file is 10GB. The problem isn't in the first 100 lines, but at close to 1Mth line. I printed the od portion which was apparently enough to figure out the problem. My code doesn't uncompress the code, I did that by hand diagnostically. My code doesn't create the file, so that's not relevant either. Thanks for offering to help, but you could stand to be less combative.

muckabout 2010-05-03 08:17:59

Answer 2

+2 A:

Try again with no codec. The following reproduces your problem when using codec, and the absence of the problem without it:

import gzip 
import os 
import codecs 

data = gzip.open("file.gz", "wb") 
data.write('foo\x1d\x1cbar\nbaz') 
data.close() 

print list(codecs.getreader('utf-8')(gzip.open('file.gz'))) 
print list(os.popen('zcat file.gz')) 
print list(gzip.open('file.gz'))

Outputs:

[u'foo\x1d', u'\x1c', u'bar\n', u'baz']
['foo\x1d\x1cbar\n', 'baz']
['foo\x1d\x1cbar\n', 'baz']

DS 2010-05-02 04:20:28

ansaurus

tags:

views:

answers:

Python line file iteration and strange characters

related questions