If the files aren't too huge, the linecache module of the standard library is pretty good -- it lets you very directly ask for the Nth line of such-and-such file.
If the files are huge, I recommend something like (warning, untested code):
def readlinenum(filepath, n, BUFSIZ=65536):
bufs = [None] * 2
previous_lines = lines_so_far = 0
with open(filepath, 'b') as f
while True:
bufs[0] = f.read(BUFSIZ)
if not bufs[0]:
raise ValueError('File %s has only %d lines, not %d',
filepath, lines_so_far, n)
lines_this_block = bufs[0].count('\n')
updated_lines_count = lines_so_far + lines_this_block
if n < updated_lines_count:
break
previous_lines = lines_so_far
lines_so_far = updated_lines_count
bufs[1] = bufs[0]
if n == lines_so_far:
# line split between blocks
buf = bufs[1] + bufs[0]
delta = n - previous_lines
else: # normal case
buf = bufs[0]
delta = n = lines_so_far
f = cStringIO.StringIO(buf)
for i, line in enumerate(f):
if i == delta: break
return line.rstrip()
The general idea is to read in the file as binary, in large blocks (at least as large as the longest possible line) -- the processing (on Windows) from binary to "text" is costly on huge files -- and use the fast .count
method of strings on most blocks. At the end we can do the line parsing on a single block (two at most in the anomalous case where the line being sought spans block boundaries).
This kind of code requires careful testing and checking (which I haven't performed in this case), being prone to off-by-one and other boundary errors, so I'd recommend it only for truly huge files -- ones that would essentially bust memory if using linecache
(which just sucks up the whole file into memory instead of working by blocks). On a typical modern machine with 4GB bytes of RAM, for example, I'd start thinking about such techniques for text files that are over a GB or two.
Edit: a commenter does not believe that binary reading a file is much faster than the processing required by text mode (on Windows only). To show how wrong this is, let's use the 'U'
("universal newlines") option that forces the line-end processing to happen on Unix machines too (as I don't have a Windows machine to run this on;-). Using the usual kjv.txt file:
$ wc kjv.txt
114150 821108 4834378 kjv.txt
(4.8 MB, 114 Klines) -- about 1/1000th of the kind of file sizes I was mentioning earlier:
$ python -mtimeit 'f=open("kjv.txt", "rb")' 'f.seek(0); f.read()'
100 loops, best of 3: 13.9 msec per loop
$ python -mtimeit 'f=open("kjv.txt", "rU")' 'f.seek(0); f.read()'
10 loops, best of 3: 39.2 msec per loop
i.e., just about exactly a factor of 3 cost for the line-end processing (this is on an old-ish laptop, but the ratio should be pretty repeatable elsewhere, too).
Reading by a loop on lines, of course, is even slower:
$ python -mtimeit 'f=open("kjv.txt", "rU")' 'f.seek(0)' 'for x in f: pass'
10 loops, best of 3: 54.6 msec per loop
and using readline
as the commented mentioned (with less efficient buffering than directly looping on the file) is worst:
$ python -mtimeit 'f=open("kjv.txt", "rU")' 'f.seek(0); x=1' 'while x: x=f.readline()'
10 loops, best of 3: 81.1 msec per loop
If, as the question mentions, there are 20,000 files to read (say they're all small-ish, on the order of this kjv.txt
), the fastest approach (reading each file in binary mode in a single gulp) should take about 260 seconds, 4-5 minutes, while the slowest one (based on readline
) should take about 1600 seconds, almost half an hour -- a pretty significant difference for many, I'd say most, actual applications.