views:

146

answers:

4

I have an application that generates some large log files > 500MB.

I have written some utilities in Python that allows me to quickly browse the log file and find data of interest. But I now get some datasets where the file is too big to load it all into memory.

I thus want to scan the document once, build an index and then only load the section of the document into memory that I want to look at at a time.

This works for me when I open a 'file' read it one line at a time and store the offset with from file.tell(). I can then come back to that section of the file later with file.seek( offset, 0 ).

My problem is however that I may have UTF-8 in the log files so I need to open them with the codecs module (codecs.open(<filename>, 'r', 'utf-8')). With the resulting object I can call seek and tell but they do not match up.

I assume that codecs needs to do some buffering or maybe it returns character counts instead of bytes from tell?

Is there a way around this?

+1  A: 

If true, this sounds like a bug or limitation of the codecs module, as it's probably confusing byte and character offsets.

I would use the regular open() function for opening the file, then seek()/tell() will give you byte offsets that are always consistent. Whenever you want to read, use f.readline().decode('utf-8').

Beware though, that using the f.read() function can land you in the middle of a multi-byte character, thus producing an UTF-8 decode error. readline() will always work.

This doesn't transparently handle the byte-order mark for you, but chances are your log files do not have BOMs anyway.

intgr
Strictly speaking, UTF-8 has only one possible byte-order and therefore has no use for a byte-order mark, so the byte-order mark is invalid for UTF-8. Nevertheless, some UTF-8 encoders incorrectly prepend the byte-order mark and some UTF-8 decoders accept input encoded with the byte-order-mark.
Justice
It's true that there is no byte order in UTF-8. But the byte-order mark is often used for the purpose of indicating that a file is encoded in UTF-8; I would not call this usage "incorrect".
intgr
+1  A: 

For UTF-8, you don't actually need to open the file with codecs.open. Instead, it is reliable to read the file as a byte string first, and only then decode an individual section (invoking the .decode method on the string). Breaking the file at line boundaries is safe; the only unsafe way to split it would be in the middle of a multi-byte character (which you can recognize from its byte value > 128).

Martin v. Löwis
A: 

Update: You can't do seek/tell on the object returned by codec.open(). You need to use a normal file, and decode the strings to unicode after reading.

I do not know why it doesn't work but I can't make it work. The seek seems to only work once, for example. Then you need to close and reopen the file, which is of course not useful.

The tell does not use character positions, but doesn't show you where your position in the stream is (but probably where the underlying file object is in reading from disk).

So probably because of some sort of underlying buffering, you can't do it. But deocding after reading works just fine, so go for that.

Lennart Regebro
A: 

Much of what goes on with UTF8 in python makes sense if you look at how it was done in Python 3. In your case, it'll make quite a bit more sense if you read the Files chapter in Dive into Python 3: http://diveintopython3.org/files.html

The short of it, though, is that file.seek and file.tell work with byte positions, whereas unicode characters can take up multiple bytes. Thus, if you do:

f.seek(10)
f.read(1)
f.tell()

You can easily get something other than 17, depending on what length the one character you read was.

pavpanchekha