ansaurus

Question

How to exclude U+2028 from line separators in Python when reading file?

Answer 1

A:

If you use Python 3.0 (note that I don't, so I can't test), according to the documentation you can pass an optional newline parameter to open to specifify which line seperator to use. However, the documentation doesn't mention U+2028 at all (it only mentions \r, \n, and \r\n as line seperators), so it's actually a suprise to me that this even occurs (although I can confirm this even with Python 2.6).

balpha 2009-07-09 17:03:54

Answer 2

+2 A:

I couldn't reproduce that behavior but here's a naive solution that just merges readline results until they don't end with U+2028.

#!/usr/bin/env python

from __future__ import with_statement

def my_readlines(f):
  buf = u""
  for line in f.readlines():
    uline = line.decode('utf8')
    buf += uline
    if uline[-1] != u'\u2028':
      yield buf
      buf = u""
  if buf:
    yield buf

with open("in.txt", "rb") as fin:
  for l in my_readlines(fin):
    print l

Alexander Ljungberg 2009-07-09 18:04:17

Answer 3

+1 A:

I can't duplicate this behaviour in python 2.5, 2.6 or 3.0 on mac os x - U+2028 is always treated as non-endline. Could you go into more detail about where you see this error?

That said, here is a subclass of the "file" class that might do what you want:

#/usr/bin/python
# -*- coding: utf-8 -*-
class MyFile (file):
    def __init__(self, *arg, **kwarg):
        file.__init__(self, *arg, **kwarg)
        self.EOF = False
    def next(self, catchEOF = False):
        if self.EOF:
            raise StopIteration("End of file")
        try:
            nextLine= file.next(self)
        except StopIteration:
            self.EOF = True
            if not catchEOF:
                raise
            return ""
        if nextLine.decode("utf8")[-1] == u'\u2028':
            return nextLine+self.next(catchEOF = True)
        else:
            return nextLine

A = MyFile("someUnicode.txt")
for line in A:
    print line.strip("\n").decode("utf8")

Markus 2009-07-09 21:04:52

someone with better python unicode knowledge, is this line correct:`if nextLine.decode("utf8")[-1] == u'\u2028':`I was getting a warning without the decode statement, don't quite follow why.

Markus 2009-07-09 21:28:52

I don't know what kind of error message you are getting, but typically if the line contains non-ascii chars,then it has to be decoded into a 'unicode string' first, before it is handled by any other operation. So it's usually 1. decode, 2. do stuff to string, 3. encode back before writing to file, when handling utf files.

2009-07-09 22:09:23

Answer 4

A:

Thanks to everyone for answering. I think I know why you might not have been able to replicate this.I just realized that it happens if I decode the file when opening, as in:

f = codecs.open(filename, encoding='utf-8')
for line in f:
    print line

The lines are not separated on u2028, if I open the file first and then decode individual lines:

f = open(filename)
for line in f:
    print line.decode("utf8")

(I'm using Python 2.6 on Windows. The file was originally UTF16LE and then it was converted into UTF8).

This is very interesting, I guess I won't be using codecs.open much from now on :-).

2009-07-09 22:24:58

Answer 5

A:

The codecs module is doing the RIGHT thing. U+2028 is named "LINE SEPARATOR" with the comment "may be used to represent this semantic unambiguously". So treating it as a line separator is sensible.

Presumably the creator would not have put the U+2028 characters there without good reason ... does the file have u"\n" as well? Why do you want lines not to be split on U+2028?

John Machin 2009-07-10 01:15:31

ansaurus

tags:

views:

answers:

How to exclude U+2028 from line separators in Python when reading file?

related questions