views:

116

answers:

3

I have to read a file from a particular line number and i know the line number say "n": i have been thinking of two choice:

1)for i in range(n) fname.readline() k=readline() print k

2)i=0 for line in fname: dictionary[i]=line i=i+1

but i want to know faster alternative as i might have to perform this on different files 20000 times. is there is any other better alternatives??

i even want to know if there are other performance enchancement methods for simple looping,as my code has nested loops

thanking u

+4  A: 

If the files aren't too huge, the linecache module of the standard library is pretty good -- it lets you very directly ask for the Nth line of such-and-such file.

If the files are huge, I recommend something like (warning, untested code):

def readlinenum(filepath, n, BUFSIZ=65536):
  bufs = [None] * 2
  previous_lines = lines_so_far = 0
  with open(filepath, 'b') as f
    while True:
      bufs[0] = f.read(BUFSIZ)
      if not bufs[0]:
        raise ValueError('File %s has only %d lines, not %d',
                         filepath, lines_so_far, n)
      lines_this_block = bufs[0].count('\n')
      updated_lines_count = lines_so_far + lines_this_block
      if n < updated_lines_count:
          break
      previous_lines = lines_so_far
      lines_so_far = updated_lines_count
      bufs[1] = bufs[0]
    if n == lines_so_far:
      # line split between blocks
      buf = bufs[1] + bufs[0]
      delta = n - previous_lines
    else:  # normal case
      buf = bufs[0]
      delta = n = lines_so_far
    f = cStringIO.StringIO(buf)
    for i, line in enumerate(f):
      if i == delta: break
    return line.rstrip()

The general idea is to read in the file as binary, in large blocks (at least as large as the longest possible line) -- the processing (on Windows) from binary to "text" is costly on huge files -- and use the fast .count method of strings on most blocks. At the end we can do the line parsing on a single block (two at most in the anomalous case where the line being sought spans block boundaries).

This kind of code requires careful testing and checking (which I haven't performed in this case), being prone to off-by-one and other boundary errors, so I'd recommend it only for truly huge files -- ones that would essentially bust memory if using linecache (which just sucks up the whole file into memory instead of working by blocks). On a typical modern machine with 4GB bytes of RAM, for example, I'd start thinking about such techniques for text files that are over a GB or two.

Edit: a commenter does not believe that binary reading a file is much faster than the processing required by text mode (on Windows only). To show how wrong this is, let's use the 'U' ("universal newlines") option that forces the line-end processing to happen on Unix machines too (as I don't have a Windows machine to run this on;-). Using the usual kjv.txt file:

$ wc kjv.txt
  114150  821108 4834378 kjv.txt

(4.8 MB, 114 Klines) -- about 1/1000th of the kind of file sizes I was mentioning earlier:

$ python -mtimeit 'f=open("kjv.txt", "rb")' 'f.seek(0); f.read()'
100 loops, best of 3: 13.9 msec per loop
$ python -mtimeit 'f=open("kjv.txt", "rU")' 'f.seek(0); f.read()'
10 loops, best of 3: 39.2 msec per loop

i.e., just about exactly a factor of 3 cost for the line-end processing (this is on an old-ish laptop, but the ratio should be pretty repeatable elsewhere, too).

Reading by a loop on lines, of course, is even slower:

$ python -mtimeit 'f=open("kjv.txt", "rU")' 'f.seek(0)' 'for x in f: pass'
10 loops, best of 3: 54.6 msec per loop

and using readline as the commented mentioned (with less efficient buffering than directly looping on the file) is worst:

$ python -mtimeit 'f=open("kjv.txt", "rU")' 'f.seek(0); x=1' 'while x: x=f.readline()'
10 loops, best of 3: 81.1 msec per loop

If, as the question mentions, there are 20,000 files to read (say they're all small-ish, on the order of this kjv.txt), the fastest approach (reading each file in binary mode in a single gulp) should take about 260 seconds, 4-5 minutes, while the slowest one (based on readline) should take about 1600 seconds, almost half an hour -- a pretty significant difference for many, I'd say most, actual applications.

Alex Martelli
linecache helped me,thank u
kaushik
hmmm, this claim is weird to me - why would text files be slow in Windows? i never had such problems in c or pascal or python...
Nas Banov
@EnTerr, how much experience do you have processing `rt` vs `rb` multi-GB files? Reading the file as `t` means the runtime must turn each `\r\n` into a `\n` (and "shift" all subsequent text down 1 byte in the buffer) -- over the gigabytes, it sure mounts up (plus, locating 2-character sequences like `\r\n` is not as fast, at machine level, as locating the single character `\n`). I have no Windows machine at hand to confirm, but, try yourself: make up a 4GB text file with 1000 concatenated copies of the King James' Bible text (each 4MB), read it as `rb` vs `rt` many times, measure.
Alex Martelli
@Alex, what's 'rt' here? I almost just open file with r or rb mode, and I can't find the right place in the python's library manual for the rt.
sunqiang
@sunqiang, `'rt'` is "read, text-mode", and `'rb'` is "read, binary-mode". `'r'` defaults to the same as `'rt'`, but "explicit is better than implicit" (especially when one is deliberately contrasting binary vs text modes as I was here;-). I did just check the online docs and you're right that they mistakenly omit mentioning the `'t'` part (!) but see Python in a Nutshell p. 188 "Binary and text modes" for all details (link to Google Books' online scan follows in my next comment) -- BTW, patches and corrections to the Python online docs are always **very** welcome!
Alex Martelli
Alex Martelli
@Alex Martelli: don't snub me off just because of my rating, i am new here :). I have experience working with text files and I know about the convention Windows \r\n vs Unix \n vs Mac \r - but this should not slow down the runtime noticeably. For one, there is no need to shift text when you are using readline{s}. But as you suggest, let me check performance of \r\n vs \n files and will let you know.
Nas Banov
@Alex Martelli: i don't think the online docs `mistakenly omit mentioning the 't' part (!)` - python file open mode is based on C one and there for fopen **rb** is for binary file and **r** for text mode. See http://en.wikipedia.org/wiki/C_file_input/outputand http://www.mrx.net/c/openmodes.html*Note that the second argument to fopen() is just an "r" (without the "b" that specifies a binary open). Text mode is the default file open mode.*
Nas Banov
per http://www.cplusplus.com/reference/clibrary/cstdio/fopen/ *Additional characters may follow the sequence, although they should have no effect. For example, "t" is sometimes appended to make explicit the file is a text file.*I just tried in python `f=open(fname,'rxyzzy')` and that went just fine, with no complains (but there are complains if mode does not start with [rwaU]). So adding 't' works only as documentation and is functionally as useful as blessing your code - does not hurt but does not help either :-)
Nas Banov
@EnTerr, `read` (in a single gulp or very large chunks) is faster than `readline` of course -- and there, the shifting *is* required (plus, some checking and processing is required in every case). Just thought of a way to show it w/o needing a Windows machine (the universal-newline option), editing my A accordingly.
Alex Martelli
On Python 2.6 "rU" variant is 10 times slower than "rb": `16.1` msec vs. `1.33` msec; `readline()` is 1.5 times slower than `for x in f`: 35.5 msec vs. 21.6 msec. `wc kjv.txt` - 92870 969905 5371778 kjv.txt from http://ebible.org/kjv/
J.F. Sebastian
@Alex Martelli: can we just frame the issue? You are stating that `the processing (on Windows) from binary to "text" is costly on huge files` because `reading the file as t means the runtime must turn each \r\n into a \n (and "shift" all subsequent text down 1 byte in the buffer)` - correct? Therefore if i take the a big file once with \n and once with \r\n line ends and run the same readline() loop on it, there should be drastic difference in speed between the two because in the windows-style text file there is extra processing, from what i understand you are saying.
Nas Banov
@EnTerr, on Windows, on the file with only `\n`, the runtime, reading in text mode, would still be looking for line-ends (`\r\n`) throughout the buffer, but find none (thereby considering it all one huge "line", and performing no shifting) -- not quite as fast as `b` mode, where the looking would also be skipped, but between that and text mode when the huge file _does_ have many `\r\n` pairs. You've seen my comparison (in the A's edit -- and @J.F.Sebastian's in his comment) whereby `rU` is 3-10 times slower than `rb`: on Windows, on a normal-for-windows textfile, `rb` vs `r` shd be similar.
Alex Martelli
@Alex Martelli: your theory makes sense - but in practice python does read \n only files as text in Windows just fine. Here is what i get (testing on 300MB text file, looking for line# 2500000): LF-only file = MarteliBin (your code above) 2.2sec, Readline() 'r' 10.5s, Readlines(n) 'r' 6.5sec. On same file with CRLF ends: MarteliBin 2.3s, Readline() 'r' 9.9s, Readlines(n) 'r' 6.3s. It's counter-intuitive but reading _longer_ file (with CRLF) in text mode on Windows takes _less_ time than reading same file with LF only.
Nas Banov
@Alex Martelli: PS. Oh yeah, and just like you said, your code has some off-by-1 or other boundary issue because it returns a line that is a few positions off from what was requested - and also sometimes returns partial line, probably cut by the buffer edge. (This doesn't bother me at all - i know it's fixable)
Nas Banov
re 'rU' - i had never used it before but it is **humongously** slow when i tried it now doing readline(), takes ~43sec on my test case (both LF and CRLF), which is 6x slower. Amusingly enough, when doing readlines(n) to read blocks of lines, the time is unchanged from 'r' mode. what teh heck!
Nas Banov
+2  A: 

Unless you know, or can figure out, the offset of line n in your file (for example, if every line were of a fixed width), you will have to read lines until you get to the nth one.

Regarding your examples:

  • xrange is faster than range since range has to generate a list, whereas xrange uses a generator
  • if you only need line n, why are you storing all of the lines in a dictionary?
danben
A: 

Caching a list of offsets of every end-of-line character in the file would cost a lot of memory, but caching roughly one per memory page (generally 4KB) gives mostly the same reduction in I/O, and the cost of scanning a couple KB from a known offset is negligible. So, if your average line length is 40 characters, you only have to cache a list of every 100th end-of-line in the file. Exactly where you draw the line depends on how much memory you have and how fast your I/O is. You might even be able to get away with caching a list of the offsets of every 1000th end-of-line character without a noticeable difference in performance from indexing every single one.

Chris