views:

100

answers:

3

Let's say that I routinely have to work with files with an unknown, but large, number of lines. Each line contains a set of integers (space, comma, semicolon, or some non-numeric character is the delimiter) in the closed interval [0, R], where R can be arbitrarily large. The number of integers on each line can be variable. Often times I get the same number of integers on each line, but occasionally I have lines with unequal sets of numbers.

Suppose I want to go to Nth line in the file and retrieve the Kth number on that line (and assume that the inputs N and K are valid --- that is, I am not worried about bad inputs). How do I go about doing this efficiently in Python 3.1.2 for Windows?

I do not want to traverse the file line by line.

I tried using mmap, but while poking around here on SO, I learned that that's probably not the best solution on a 32-bit build because of the 4GB limit. And in truth, I couldn't really figure out how to simply move N lines away from my current position. If I can at least just "jump" to the Nth line then I can use .split() and grab the Kth integer that way.

The nuance here is that I don't just need to grab one line from the file. I will need to grab several lines: they are not necessarily all near each other, the order in which I get them matters, and the order is not always based on some deterministic function.

Any ideas? I hope this is enough information.

Thanks!

+3  A: 

The problem is that since your lines are not of fixed length, you have to pay attention to line end markers to do your seeking, and that effectively becomes "traversing the file line by line". Thus, any viable approach is still going to be traversing the file, it's merely a matter of what can traverse it fastest.

Amber
suppose I loosen the requirement about lines of unequal length. by length do you mean actual number of characters or the number of integers? the most structured files I will get are ones with the same number of integers separated by some delimiter on each line, but they won't necessarily have the same number of characters. does that make it possible to do what I need?
B Rivera
@BRivera, the issue is number of bytes per line -- how those bytes in a line are divvied up among integers or anything else is irrelevant.
Alex Martelli
Assuming they're text files (and assuming the file is encoded in a fixed-number-of-bytes character encoding, such as plain ASCII), "equal length" means equal number of characters. The reason why it makes a difference is because if you know that the lines are of fixed *character* length, you also then know that the lines are of fixed *byte* length, and thus you can use `file.seek(linelength*(linenum-1))` to move to the byte that starts a given line. But this only works if `linelength` is the same for all lines. Otherwise, that calculation is impossible.
Amber
+1 fair enough. maybe what I am about to ask should be moved to another question: what i receive are text files with ascii encoding. i'll further restrict my numbers to be in the range of [0, 2**32 - 1]. is there a file format that will treat each integer as the same size? or is there a way to convert my file to such a format? in this way, i could use the .seek command in the manner in which you suggested. i understand that this will increase my file size. but i can deal with that.
B Rivera
A binary file format that represents numbers as actual sets of 4 (32-bit) bytes each in the file would be fixed-width (assuming the same amount of numbers on each line). The reason why numbers in ASCII-format aren't fixed width is because ASCII represents numbers in base-10 whereas it's being stored in a binary (base-2) format. Since representation differs from storage, the width isn't fixed. Of course, one option would be to force it to be fixed even in ASCII - by left-padding the numbers with zeros so that every number has the same digit count.
Amber
yep, thanks a lot for your suggestions! unfortunately, i have to pick one solution as best answer. i've +1'd as all your responses!
B Rivera
+10  A: 

Python's seek goes to a byte offset in a file, not to a line offset, simply because that's the way modern operating systems and their filesystems work -- the OS/FS just don't record or remember "line offsets" in any way whatsoever, and there's no way for Python (or any other language) to just magically guess them. Any operation purporting to "go to a line" will inevitably need to "walk through the file" (under the covers) to make the association between line numbers and byte offsets.

If you're OK with that and just want it hidden from your sight, then the solution is the standard library module linecache -- but performance won't be any better than that of code you could write yourself.

If you need to read from the same large file multiple times, a large optimization would be to run once on that large file a script that builds and saves to disk the line number - to - byte offset correspondence (technically an "index" auxiliary file); then, all your successive runs (until the large file changes) could very speedily use the index file to navigate with very high performance through the large file. Is this your use case...?

Edit: since apparently this may apply -- here's the general idea (net of careful testing, error checking, or optimization;-). To make the index, use makeindex.py, as follows:

import array
import sys

BLOCKSIZE = 1024 * 1024

def reader(f):
  blockstart = 0
  while True:
    block = f.read(BLOCKSIZE)
    if not block: break
    inblock = 0
    while True:
      nextnl = block.find(b'\n', inblock)
      if nextnl < 0:
        blockstart += len(block)
        break
      yield nextnl + blockstart
      inblock = nextnl + 1

def doindex(fn):
  with open(fn, 'rb') as f:
    # result format: x[0] is tot # of lines,
    # x[N] is byte offset of END of line N (1+)
    result = array.array('L', [0])
    result.extend(reader(f))
    result[0] = len(result) - 1
    return result

def main():
  for fn in sys.argv[1:]:
    index = doindex(fn)
    with open(fn + '.indx', 'wb') as p:
      print('File', fn, 'has', index[0], 'lines')
      index.tofile(p)

main()

and then to use it, for example, the following useindex.py:

import array
import sys

def readline(n, f, findex):
  f.seek(findex[n] + 1)
  bytes = f.read(findex[n+1] - findex[n])
  return bytes.decode('utf8')

def main():
  fn = sys.argv[1]
  with open(fn + '.indx', 'rb') as f:
    findex = array.array('l')
    findex.fromfile(f, 1)
    findex.fromfile(f, findex[0])
    findex[0] = -1
  with open(fn, 'rb') as f:
    for n in sys.argv[2:]:
      print(n, repr(readline(int(n), f, findex)))

main()

Here's an example (on my slow laptop):

$ time py3 makeindex.py kjv10.txt 
File kjv10.txt has 100117 lines

real    0m0.235s
user    0m0.184s
sys 0m0.035s
$ time py3 useindex.py kjv10.txt 12345 98765 33448
12345 '\r\n'
98765 '2:6 But this thou hast, that thou hatest the deeds of the\r\n'
33448 'the priest appointed officers over the house of the LORD.\r\n'

real    0m0.049s
user    0m0.028s
sys 0m0.020s
$ 

The sample file is a plain text file of King James' Bible:

$ wc kjv10.txt
100117  823156 4445260 kjv10.txt

100K lines, 4.4 MB, as you can see; this takes about a quarter second to index and 50 milliseconds to read and print out three random-y lines (no doubt this can be vastly accelerated with more careful optimization and a better machine). The index in memory (and on disk too) takes 4 bytes per line of the textfile being indexed, and performance should scale in a perfectly linear way, so if you had about 100 million lines, 4.4 GB, I would expect about 4-5 minutes to build the index, a minute to extract and print out three arbitrary lines (and the 400 MB of RAM taken for the index should not inconvenience even a small machine -- even my tiny slow laptop has 2GB after all;-).

You can also see that (for speed and convenience) I treat the file as binary (and assume utf8 encoding -- works with any subset like ASCII too of course, eg that KJ text file is ASCII) and don't bother collapsing \r\n into a single character if that's what the file has as line terminator (it's pretty trivial to do that after reading each line if you want).

Alex Martelli
your optimization solution is nice. i like it. +1. i still have to traverse the index auxiliary file though right? but of course, that's better than having to traverse my original file.linecache seems to load the whole file into RAM and i won't always have that luxury.
B Rivera
@B Rivera, the "index auxiliary file" should be small enough to keep in RAM even for a textfile of several million lines. Let me sketch a simplistic solution to show what I have in mind (I'll be editing my answer soon to show that).
Alex Martelli
you know what, i understand what you are saying. use linecache on the index auxiliary file. that's certainly reasonable.
B Rivera
Better yet, don't use a text format for your index; use a binary format with fixed field sizes. Then you can use `seek()` on that as well.
Amber
a list of 64bit offsets is sufficient for the index file to allow O(1) lookups
gnibbler
i really appreciate your time and your response. this is a viable general solution.
B Rivera
@Amber, yep -- that's what I ended using with my `array` based solution (though the array item size might be scarce, esp. since when reading I turn it into _signed_ numbers for mere convenience purposes).
Alex Martelli
@gnibbler, yes, and on a 32-bit build of Python my solution would only use 32-bit offsets, insufficient for really large files (esp. since I turned them into signed ones for convenience, AKA "cutting corners";-). There's no really handy way to have an array of 64-bit integers in a 32-bit build -- I might turn to `'d'` (double precision numbers offering 53-bit fractions), or take the trouble to save and reload each offset as _two_ 32-bit halves.
Alex Martelli
@B Rivera, thanks!, but for generality do read my last 2 comments on how this may prove limiting in a 32-bit build (since the array's items would then be just 32 bits too) and musings on possible solutions.
Alex Martelli
A: 

Another solution, if the file is potentially going to change a lot, is to go full-way to a proper database. The database engine will create and maintain the indexes for you so you can do very fast searches/queries.

This may be an overkill though.

Lie Ryan