ansaurus

Question

Python thinks a 3000-line text file is one line long?

Answer 1

+6 A:

You'll probably find it's the "with CR line terminators" that gives the game away. If you're working on a platform that uses newlines as line terminators, it will see your file as one big honkin' line.

Change your input file so that it uses the correct line terminators. Your editor is probably more forgiving than your Python implementation.

The CR line endings are a Mac thing as far as I'm aware and you can use the U mode modifier to open to auto-detect based on the first line terminator found.

paxdiablo 2010-02-02 14:10:00

`Nail+head` combo me thinks. +1.

Wim Hollebrandse 2010-02-02 14:12:57

Thanks. Any idea what I need to change them to?

AP257 2010-02-02 14:13:45

I would say `\n`.

Wim Hollebrandse 2010-02-02 14:14:40

could either be CR+LF (Windows) or LF (but this would be on older macs).

Adriano Varoli Piazza 2010-02-02 14:14:53

@Adriano: CR is the line terminator for older macs. It's LF for all *nix systems.

ΤΖΩΤΖΙΟΥ 2010-02-02 20:13:18

@TZOOTZIOY: I shouldn't have made that mistake. Brown paper bag time.

Adriano Varoli Piazza 2010-02-03 11:58:31

Answer 2

A:

open() returns a file object. You need to use:

for line in open('textbase.txt', 'r').readlines():
    print line

Paul 2010-02-02 14:10:17

This isn't necessary, as the open file object behaves like an iterator.

Ben James 2010-02-02 14:12:01

Makes no difference, sorry...

AP257 2010-02-02 14:12:43

Ah...good point. Hadn't aprreciated that.

Paul 2010-02-02 14:14:59

Yeah, sorry I upvoted this before realising my mistake.

Kragen 2010-02-02 14:15:52

Answer 3

+20 A:

According to the documentation for open(), you should add a U to the mode:

open('textbase.txt', 'Ur')

This enables "universal newlines", which normalizes them to \n in the strings it gives you.

However, the correct thing to do is to decode the UTF-16BE into Unicode objects first, before translating the newlines. Otherwise, a chance 0x0d byte could get erroneously turned into a 0x0a, resulting in

UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 12: truncated data.

Python's codecs module supplies an open function that can decode Unicode and handle newlines at the same time:

import codecs
for line in codecs.open('textbase.txt', 'Ur', 'utf-16be'):
    ...

If the file has a byte order mark (BOM) and you specify 'utf-16', then it detects the endianness and hides the BOM for you. If it does not (since the BOM is optional), then that decoder will just go ahead and use your system's endianness, which probably won't be good.

Specifying the endianness yourself (with 'utf-16be') will not hide the BOM, so you might wish to use this hack:

import codecs
firstline = True
for line in codecs.open('textbase.txt', 'Ur', 'utf-16be'):
    if firstline:
        firstline = False
        line = line.lstrip(u'\ufeff')

See also: Python Unicode HOWTO

jleedev 2010-02-02 14:12:24

+1 for the solution rather than just the analysis (as in my answer) - you were too fast for me :-)

paxdiablo 2010-02-02 14:17:20

Solves the problem, python now sees all the lines. Thank you so much: I love this site :)

AP257 2010-02-02 14:18:38

@AP257: do they also decode properly? If it's really UTF-16BE, there'll be zero byte in front of every line, since Python's file object is encoding-unaware and just splits on newline characters. IMHO, you'll have to decode the file (by using the codecs module) properly before splitting into lines is possible.

Torsten Marek 2010-02-02 14:24:08

@Torsten Since we're using big endian, the nulls come before the newlines, so a code point will not get chopped in half. That's a good point however. http://bugs.python.org/issue691291

jleedev 2010-02-02 14:25:57

@jleedev: Right, my bad. I confused the test files in my experiments.

Torsten Marek 2010-02-02 14:34:27

@Torsten Thanks for the tip anyway; I've updated this to use the codecs module.

jleedev 2010-02-02 14:37:24

@jleedev: Welcome. The source for my confusion was that recode (the tool) by default creates UTF-16BE (of course including the BOM) when I specify UTF-16 (with order) as the target. I thought it'd create whatever is the platform endianness (LE, since it's x86-64).

Torsten Marek 2010-02-02 14:42:55

Answer 4

+1 A:

Hi,

it looks like your file has lines terminated only by CR, and Python is probably expecting LF or CRLF. Try using the 'universal newline':

for line in open('textbase.txt', 'rU'):
    print 'hello world'

http://docs.python.org/library/functions.html?highlight=open#open

Miron Brezuleanu 2010-02-02 14:13:52

ansaurus

tags:

views:

answers:

Python thinks a 3000-line text file is one line long?

related questions