I have an application that generates some large log files > 500MB.
I have written some utilities in Python that allows me to quickly browse the log file and find data of interest. But I now get some datasets where the file is too big to load it all into memory.
I thus want to scan the document once, build an index and then only load the section of the document into memory that I want to look at at a time.
This works for me when I open a 'file' read it one line at a time and store the offset with from file.tell(). I can then come back to that section of the file later with file.seek( offset, 0 ).
My problem is however that I may have UTF-8 in the log files so I need to open them with the codecs module (codecs.open(<filename>, 'r', 'utf-8')
). With the resulting object I can call seek and tell but they do not match up.
I assume that codecs needs to do some buffering or maybe it returns character counts instead of bytes from tell?
Is there a way around this?