ansaurus

Question

Answer 1

+4 A:

If the files aren't too huge, the linecache module of the standard library is pretty good -- it lets you very directly ask for the Nth line of such-and-such file.

If the files are huge, I recommend something like (warning, untested code):

def readlinenum(filepath, n, BUFSIZ=65536):
  bufs = [None] * 2
  previous_lines = lines_so_far = 0
  with open(filepath, 'b') as f
    while True:
      bufs[0] = f.read(BUFSIZ)
      if not bufs[0]:
        raise ValueError('File %s has only %d lines, not %d',
                         filepath, lines_so_far, n)
      lines_this_block = bufs[0].count('\n')
      updated_lines_count = lines_so_far + lines_this_block
      if n < updated_lines_count:
          break
      previous_lines = lines_so_far
      lines_so_far = updated_lines_count
      bufs[1] = bufs[0]
    if n == lines_so_far:
      # line split between blocks
      buf = bufs[1] + bufs[0]
      delta = n - previous_lines
    else:  # normal case
      buf = bufs[0]
      delta = n = lines_so_far
    f = cStringIO.StringIO(buf)
    for i, line in enumerate(f):
      if i == delta: break
    return line.rstrip()

The general idea is to read in the file as binary, in large blocks (at least as large as the longest possible line) -- the processing (on Windows) from binary to "text" is costly on huge files -- and use the fast .count method of strings on most blocks. At the end we can do the line parsing on a single block (two at most in the anomalous case where the line being sought spans block boundaries).

This kind of code requires careful testing and checking (which I haven't performed in this case), being prone to off-by-one and other boundary errors, so I'd recommend it only for truly huge files -- ones that would essentially bust memory if using linecache (which just sucks up the whole file into memory instead of working by blocks). On a typical modern machine with 4GB bytes of RAM, for example, I'd start thinking about such techniques for text files that are over a GB or two.

Edit: a commenter does not believe that binary reading a file is much faster than the processing required by text mode (on Windows only). To show how wrong this is, let's use the 'U' ("universal newlines") option that forces the line-end processing to happen on Unix machines too (as I don't have a Windows machine to run this on;-). Using the usual kjv.txt file:

$ wc kjv.txt
  114150  821108 4834378 kjv.txt

(4.8 MB, 114 Klines) -- about 1/1000th of the kind of file sizes I was mentioning earlier:

$ python -mtimeit 'f=open("kjv.txt", "rb")' 'f.seek(0); f.read()'
100 loops, best of 3: 13.9 msec per loop
$ python -mtimeit 'f=open("kjv.txt", "rU")' 'f.seek(0); f.read()'
10 loops, best of 3: 39.2 msec per loop

i.e., just about exactly a factor of 3 cost for the line-end processing (this is on an old-ish laptop, but the ratio should be pretty repeatable elsewhere, too).

Reading by a loop on lines, of course, is even slower:

$ python -mtimeit 'f=open("kjv.txt", "rU")' 'f.seek(0)' 'for x in f: pass'
10 loops, best of 3: 54.6 msec per loop

and using readline as the commented mentioned (with less efficient buffering than directly looping on the file) is worst:

$ python -mtimeit 'f=open("kjv.txt", "rU")' 'f.seek(0); x=1' 'while x: x=f.readline()'
10 loops, best of 3: 81.1 msec per loop

If, as the question mentions, there are 20,000 files to read (say they're all small-ish, on the order of this kjv.txt), the fastest approach (reading each file in binary mode in a single gulp) should take about 260 seconds, 4-5 minutes, while the slowest one (based on readline) should take about 1600 seconds, almost half an hour -- a pretty significant difference for many, I'd say most, actual applications.

Alex Martelli 2010-06-10 04:48:35

linecache helped me,thank u

kaushik 2010-06-10 05:26:49

hmmm, this claim is weird to me - why would text files be slow in Windows? i never had such problems in c or pascal or python...

Nas Banov 2010-06-10 05:56:10

@EnTerr, how much experience do you have processing `rt` vs `rb` multi-GB files? Reading the file as `t` means the runtime must turn each `\r\n` into a `\n` (and "shift" all subsequent text down 1 byte in the buffer) -- over the gigabytes, it sure mounts up (plus, locating 2-character sequences like `\r\n` is not as fast, at machine level, as locating the single character `\n`). I have no Windows machine at hand to confirm, but, try yourself: make up a 4GB text file with 1000 concatenated copies of the King James' Bible text (each 4MB), read it as `rb` vs `rt` many times, measure.

Alex Martelli 2010-06-10 14:21:17

@Alex, what's 'rt' here? I almost just open file with r or rb mode, and I can't find the right place in the python's library manual for the rt.

sunqiang 2010-06-10 15:43:11

@sunqiang, `'rt'` is "read, text-mode", and `'rb'` is "read, binary-mode". `'r'` defaults to the same as `'rt'`, but "explicit is better than implicit" (especially when one is deliberately contrasting binary vs text modes as I was here;-). I did just check the online docs and you're right that they mistakenly omit mentioning the `'t'` part (!) but see Python in a Nutshell p. 188 "Binary and text modes" for all details (link to Google Books' online scan follows in my next comment) -- BTW, patches and corrections to the Python online docs are always **very** welcome!

Alex Martelli 2010-06-10 17:25:44

Alex Martelli 2010-06-10 17:26:54

@Alex Martelli: don't snub me off just because of my rating, i am new here :). I have experience working with text files and I know about the convention Windows \r\n vs Unix \n vs Mac \r - but this should not slow down the runtime noticeably. For one, there is no need to shift text when you are using readline{s}. But as you suggest, let me check performance of \r\n vs \n files and will let you know.

Nas Banov 2010-06-10 18:52:53

@Alex Martelli: i don't think the online docs `mistakenly omit mentioning the 't' part (!)` - python file open mode is based on C one and there for fopen **rb** is for binary file and **r** for text mode. See http://en.wikipedia.org/wiki/C_file_input/outputand http://www.mrx.net/c/openmodes.html*Note that the second argument to fopen() is just an "r" (without the "b" that specifies a binary open). Text mode is the default file open mode.*

Nas Banov 2010-06-10 20:19:48

per http://www.cplusplus.com/reference/clibrary/cstdio/fopen/ *Additional characters may follow the sequence, although they should have no effect. For example, "t" is sometimes appended to make explicit the file is a text file.*I just tried in python `f=open(fname,'rxyzzy')` and that went just fine, with no complains (but there are complains if mode does not start with [rwaU]). So adding 't' works only as documentation and is functionally as useful as blessing your code - does not hurt but does not help either :-)

Nas Banov 2010-06-10 20:26:29

@EnTerr, `read` (in a single gulp or very large chunks) is faster than `readline` of course -- and there, the shifting *is* required (plus, some checking and processing is required in every case). Just thought of a way to show it w/o needing a Windows machine (the universal-newline option), editing my A accordingly.

Alex Martelli 2010-06-10 21:52:20

On Python 2.6 "rU" variant is 10 times slower than "rb": `16.1` msec vs. `1.33` msec; `readline()` is 1.5 times slower than `for x in f`: 35.5 msec vs. 21.6 msec. `wc kjv.txt` - 92870 969905 5371778 kjv.txt from http://ebible.org/kjv/

J.F. Sebastian 2010-06-10 23:31:10

@Alex Martelli: can we just frame the issue? You are stating that `the processing (on Windows) from binary to "text" is costly on huge files` because `reading the file as t means the runtime must turn each \r\n into a \n (and "shift" all subsequent text down 1 byte in the buffer)` - correct? Therefore if i take the a big file once with \n and once with \r\n line ends and run the same readline() loop on it, there should be drastic difference in speed between the two because in the windows-style text file there is extra processing, from what i understand you are saying.

Nas Banov 2010-06-11 00:41:40

@EnTerr, on Windows, on the file with only `\n`, the runtime, reading in text mode, would still be looking for line-ends (`\r\n`) throughout the buffer, but find none (thereby considering it all one huge "line", and performing no shifting) -- not quite as fast as `b` mode, where the looking would also be skipped, but between that and text mode when the huge file _does_ have many `\r\n` pairs. You've seen my comparison (in the A's edit -- and @J.F.Sebastian's in his comment) whereby `rU` is 3-10 times slower than `rb`: on Windows, on a normal-for-windows textfile, `rb` vs `r` shd be similar.

Alex Martelli 2010-06-11 01:42:32

@Alex Martelli: your theory makes sense - but in practice python does read \n only files as text in Windows just fine. Here is what i get (testing on 300MB text file, looking for line# 2500000): LF-only file = MarteliBin (your code above) 2.2sec, Readline() 'r' 10.5s, Readlines(n) 'r' 6.5sec. On same file with CRLF ends: MarteliBin 2.3s, Readline() 'r' 9.9s, Readlines(n) 'r' 6.3s. It's counter-intuitive but reading _longer_ file (with CRLF) in text mode on Windows takes _less_ time than reading same file with LF only.

Nas Banov 2010-06-11 02:12:13

@Alex Martelli: PS. Oh yeah, and just like you said, your code has some off-by-1 or other boundary issue because it returns a line that is a few positions off from what was requested - and also sometimes returns partial line, probably cut by the buffer edge. (This doesn't bother me at all - i know it's fixable)

Nas Banov 2010-06-11 02:18:39

re 'rU' - i had never used it before but it is **humongously** slow when i tried it now doing readline(), takes ~43sec on my test case (both LF and CRLF), which is 6x slower. Amusingly enough, when doing readlines(n) to read blocks of lines, the time is unchanged from 'r' mode. what teh heck!

Nas Banov 2010-06-11 02:46:49

Answer 2

+2 A:

Unless you know, or can figure out, the offset of line n in your file (for example, if every line were of a fixed width), you will have to read lines until you get to the nth one.

Regarding your examples:

xrange is faster than range since range has to generate a list, whereas xrange uses a generator
if you only need line n, why are you storing all of the lines in a dictionary?

danben 2010-06-10 04:49:41

Answer 3

A:

Caching a list of offsets of every end-of-line character in the file would cost a lot of memory, but caching roughly one per memory page (generally 4KB) gives mostly the same reduction in I/O, and the cost of scanning a couple KB from a known offset is negligible. So, if your average line length is 40 characters, you only have to cache a list of every 100th end-of-line in the file. Exactly where you draw the line depends on how much memory you have and how fast your I/O is. You might even be able to get away with caching a list of the offsets of every 1000th end-of-line character without a noticeable difference in performance from indexing every single one.

Chris 2010-06-10 05:35:52

ansaurus

tags:

views:

answers:

any faster alternative??

related questions