views:

271

answers:

5

I have a file in UTF-8, where some lines contain the U+2028 Line Separator character (http://www.fileformat.info/info/unicode/char/2028/index.htm). I don't want it to be treated as a line break when I read lines from the file. Is there a way to exclude it from separators when I iterate over the file or use readlines()? (Besides reading the entire file into a string and then splitting by \n.) Thank you!

A: 

If you use Python 3.0 (note that I don't, so I can't test), according to the documentation you can pass an optional newline parameter to open to specifify which line seperator to use. However, the documentation doesn't mention U+2028 at all (it only mentions \r, \n, and \r\n as line seperators), so it's actually a suprise to me that this even occurs (although I can confirm this even with Python 2.6).

balpha
+2  A: 

I couldn't reproduce that behavior but here's a naive solution that just merges readline results until they don't end with U+2028.

#!/usr/bin/env python

from __future__ import with_statement

def my_readlines(f):
  buf = u""
  for line in f.readlines():
    uline = line.decode('utf8')
    buf += uline
    if uline[-1] != u'\u2028':
      yield buf
      buf = u""
  if buf:
    yield buf

with open("in.txt", "rb") as fin:
  for l in my_readlines(fin):
    print l
Alexander Ljungberg
+1  A: 

I can't duplicate this behaviour in python 2.5, 2.6 or 3.0 on mac os x - U+2028 is always treated as non-endline. Could you go into more detail about where you see this error?

That said, here is a subclass of the "file" class that might do what you want:

#/usr/bin/python
# -*- coding: utf-8 -*-
class MyFile (file):
    def __init__(self, *arg, **kwarg):
        file.__init__(self, *arg, **kwarg)
        self.EOF = False
    def next(self, catchEOF = False):
        if self.EOF:
            raise StopIteration("End of file")
        try:
            nextLine= file.next(self)
        except StopIteration:
            self.EOF = True
            if not catchEOF:
                raise
            return ""
        if nextLine.decode("utf8")[-1] == u'\u2028':
            return nextLine+self.next(catchEOF = True)
        else:
            return nextLine

A = MyFile("someUnicode.txt")
for line in A:
    print line.strip("\n").decode("utf8")
Markus
someone with better python unicode knowledge, is this line correct:`if nextLine.decode("utf8")[-1] == u'\u2028':`I was getting a warning without the decode statement, don't quite follow why.
Markus
I don't know what kind of error message you are getting, but typically if the line contains non-ascii chars,then it has to be decoded into a 'unicode string' first, before it is handled by any other operation. So it's usually 1. decode, 2. do stuff to string, 3. encode back before writing to file, when handling utf files.
A: 

Thanks to everyone for answering. I think I know why you might not have been able to replicate this.I just realized that it happens if I decode the file when opening, as in:

f = codecs.open(filename, encoding='utf-8')
for line in f:
    print line

The lines are not separated on u2028, if I open the file first and then decode individual lines:

f = open(filename)
for line in f:
    print line.decode("utf8")

(I'm using Python 2.6 on Windows. The file was originally UTF16LE and then it was converted into UTF8).

This is very interesting, I guess I won't be using codecs.open much from now on :-).

A: 

The codecs module is doing the RIGHT thing. U+2028 is named "LINE SEPARATOR" with the comment "may be used to represent this semantic unambiguously". So treating it as a line separator is sensible.

Presumably the creator would not have put the U+2028 characters there without good reason ... does the file have u"\n" as well? Why do you want lines not to be split on U+2028?

John Machin