views:

976

answers:

2

I have the following file:

abcde
kwakwa
<0x1A>
line3
linllll

Where <0x1A> represents a byte with the hex value of 0x1A. When attempting to read this file in Python as:

for line in open('t.txt'):
    print line,

It only reads the first two lines, and exits the loop.

The solution seems to be to open the file in binary (or universal newline mode) - 'rb' or 'rU'. Can you explain this behavior ?

+21  A: 

0x1A is Ctrl-Z, and DOS historically used that as an end-of-file marker. For example, try using a command prompt, and "type"ing your file. It will only display the content up the Ctrl-Z.

Python uses the Windows CRT function _wfopen, which implements the "Ctrl-Z is EOF" semantics.

Ned Batchelder
If I hadn't checked my facts, I could have been first! I shake my fist in helpless fury!
S.Lott
Under Linux, however, it works fine.
Federico Ramponi
Reminds me that I once had to build a PostScript document with LaTeX that included PostScript images created on Windows. I wondered why the printer stopped printing after the first picture ... Well, the last byte in the PostScript picture files was 0x1A.
unbeknown
+6  A: 

Ned is of course correct.

If your curiosity runs a little deeper, the root cause is backwards compatibility taken to an extreme. Windows is compatible with DOS, which used Ctrl-Z as an optional end of file marker for text files. What you might not know is that DOS was compatible with CP/M, which was popular on small computers before the PC. CP/M's file system didn't keep track of file sizes down to the byte level, it only kept track by the number of floppy disk sectors. If your file wasn't an exact multiple of 128 bytes, you needed a way to mark the end of the text. This Wikipedia article implies that the selection of Ctrl-Z was based on an even older convention used by DEC.

Mark Ransom