views:

210

answers:

2

I have a huge gzipped text file which I need to read, line by line. I go with the following:

for i, line in enumerate(codecs.getreader('utf-8')(gzip.open('file.gz'))):
  print i, line

At some point late in the file, the python output diverges from the file. This is because lines are getting broken due to weird special characters that python thinks are newlines. When I open the file in 'vim', they are correct, but the suspect characters are formatted weirdly. Is there something I can do to fix this?

I've tried other codecs including utf-16, latin-1. I've also tried with no codec.

I looked at the file using 'od'. Sure enough, there are \n characters where they shouldn't be. But, the "wrong" ones are prepended by a weird character. I think there's some encoding here with some characters being 2-bytes, but the trailing byte being a \n if not viewed properly.

According to 'od -h file' the offending character is '1d1c'.

If I replace:

gzip.open('file.gz')

With:

os.popen('zcat file.gz')

It works fine (and actually, quite faster). But, I'd like to know where I'm going wrong.

+1  A: 

I asked (in a comment) """Show us the output from print repr(weird_special_characters). When you open the file in vim, WHAT are correct? Please be more precise than "formatted weirdly".""" But nothing :-(

What file are you looking at with od? file.gz?? If you can see anything recognisable in there, it's not a gzip file! You're not seeing newlines, you're seeing binary bytes that contain 0x0A.

If the original file was utf-8 encoded, what was the point of trying it with other codecs?

Does "works OK with zcat" mean that you got recognisable data without a utf8 decode step??

I suggest that you simplify your code, and do it a step at a time ... see for example the accepted answer to this question. Try it again and please show the exact code that you ran, and use repr() when describing the results.

Update It looks like DS has guessed what you were trying to explain about the \x1c and \x1d.

Here are some notes on WHY it happens like that:

In ASCII, only \r and \n are considered when line-breaking:

>>> import pprint
>>> text = ''.join('A' + chr(i) for i in range(32)) + 'BBB'
>>> print repr(text)
'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\nA\x0bA\x0cA\rA\x0eA\x0fA\x10
A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x1dA\x1eA\x1fBBB'
>>> pprint.pprint(text.splitlines(True))
['A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\n', # line break
 'A\x0bA\x0cA\r', # line break
 'A\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x
1dA\x1eA\x1fBBB']
>>>

However in Unicode, the characters \x1D (FILE SEPARATOR), \x1E (GROUP SEPARATOR), and \x1E (RECORD SEPARATOR) also qualify as line-endings:

>>> text = u''.join('A' + unichr(i) for i in range(32)) + u'BBB'
>>> print repr(text)
u'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\nA\x0bA\x0cA\rA\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1cA\x1dA\x1eA\x1fBBB'
>>> pprint.pprint(text.splitlines(True))
[u'A\x00A\x01A\x02A\x03A\x04A\x05A\x06A\x07A\x08A\tA\n', # line break
 u'A\x0bA\x0cA\r', # line break
 u'A\x0eA\x0fA\x10A\x11A\x12A\x13A\x14A\x15A\x16A\x17A\x18A\x19A\x1aA\x1bA\x1c', # line break
 u'A\x1d', # line break
 u'A\x1e', # line break
 u'A\x1fBBB']
>>>

This will happen whatever codec you use. You still need to work out what (if any) codec you need to use. You also need to work out whether the original file was really a text file and not a binary file. If it's a text file, you need to consider the meaning of the \x1c and \x1d in the file.

John Machin
I'm looking at the uncompressed file (specifically the offending portion) with od. That's where the '\n' appears. I believe it is actually a 2-byte character which, when represented as 2 individual bytes, has the '\n'. Unfortunately I don't know what the original encoding is, but for some reason vim and zcat represent it correctly, meaning they are detecting it.
muckabout
"The uncompressed file"??? Your code doesn't show you creating a file. Please show the code that you actually ran that created the file that you are examining with `od`. Please show a hex dump from od of say the first 100 bytes of the file. Please show a hex dump from od of say 100 bytes centred on the offending `\n'. Please show all of that by editing your question.
John Machin
The compressed file is 10GB. The problem isn't in the first 100 lines, but at close to 1Mth line. I printed the od portion which was apparently enough to figure out the problem. My code doesn't uncompress the code, I did that by hand diagnostically. My code doesn't create the file, so that's not relevant either. Thanks for offering to help, but you could stand to be less combative.
muckabout
+2  A: 

Try again with no codec. The following reproduces your problem when using codec, and the absence of the problem without it:

import gzip 
import os 
import codecs 

data = gzip.open("file.gz", "wb") 
data.write('foo\x1d\x1cbar\nbaz') 
data.close() 

print list(codecs.getreader('utf-8')(gzip.open('file.gz'))) 
print list(os.popen('zcat file.gz')) 
print list(gzip.open('file.gz')) 

Outputs:

[u'foo\x1d', u'\x1c', u'bar\n', u'baz']
['foo\x1d\x1cbar\n', 'baz']
['foo\x1d\x1cbar\n', 'baz']
DS