ansaurus

Question

Can seek and tell work with UTF-8 encoded documents in Python?

Answer 1

+1 A:

If true, this sounds like a bug or limitation of the codecs module, as it's probably confusing byte and character offsets.

I would use the regular open() function for opening the file, then seek()/tell() will give you byte offsets that are always consistent. Whenever you want to read, use f.readline().decode('utf-8').

Beware though, that using the f.read() function can land you in the middle of a multi-byte character, thus producing an UTF-8 decode error. readline() will always work.

This doesn't transparently handle the byte-order mark for you, but chances are your log files do not have BOMs anyway.

intgr 2009-10-02 15:31:09

Strictly speaking, UTF-8 has only one possible byte-order and therefore has no use for a byte-order mark, so the byte-order mark is invalid for UTF-8. Nevertheless, some UTF-8 encoders incorrectly prepend the byte-order mark and some UTF-8 decoders accept input encoded with the byte-order-mark.

Justice 2009-10-02 16:18:45

It's true that there is no byte order in UTF-8. But the byte-order mark is often used for the purpose of indicating that a file is encoded in UTF-8; I would not call this usage "incorrect".

intgr 2009-10-02 16:58:21

Answer 2

+1 A:

For UTF-8, you don't actually need to open the file with codecs.open. Instead, it is reliable to read the file as a byte string first, and only then decode an individual section (invoking the .decode method on the string). Breaking the file at line boundaries is safe; the only unsafe way to split it would be in the middle of a multi-byte character (which you can recognize from its byte value > 128).

Martin v. Löwis 2009-10-02 15:32:08

Answer 3

A:

Update: You can't do seek/tell on the object returned by codec.open(). You need to use a normal file, and decode the strings to unicode after reading.

I do not know why it doesn't work but I can't make it work. The seek seems to only work once, for example. Then you need to close and reopen the file, which is of course not useful.

The tell does not use character positions, but doesn't show you where your position in the stream is (but probably where the underlying file object is in reading from disk).

So probably because of some sort of underlying buffering, you can't do it. But deocding after reading works just fine, so go for that.

Lennart Regebro 2009-10-02 15:36:01

Answer 4

A:

Much of what goes on with UTF8 in python makes sense if you look at how it was done in Python 3. In your case, it'll make quite a bit more sense if you read the Files chapter in Dive into Python 3: http://diveintopython3.org/files.html

The short of it, though, is that file.seek and file.tell work with byte positions, whereas unicode characters can take up multiple bytes. Thus, if you do:

f.seek(10)
f.read(1)
f.tell()

You can easily get something other than 17, depending on what length the one character you read was.

pavpanchekha 2009-10-02 16:03:55

ansaurus

tags:

views:

answers:

Can seek and tell work with UTF-8 encoded documents in Python?

related questions