views:

4070

answers:

11

Are there any alternatives to the code below:

startFromLine = 141978 # or whatever line I need to jump to

urlsfile = open(filename, "rb", 0)

linesCounter = 1

for line in urlsfile:
    if linesCounter > startFromLine:
        DoSomethingWithThisLine(line)

    linesCounter += 1

if I'm processing a huge text file (~15MB) with lines of unknown but different length, and need to jump to a particular line which number I know in advance? I feel bad by processing them one by one when I know I could ignore at least first half of the file. Looking for more elegant solution if there is any.

+13  A: 

linecache

John Ellinwood
I just checked the source code of this module: the whole file is read in memory! So I would definitely rule this answer out for the purpose of quickly accessing a given line in a file.
MiniQuark
MiniQuark, I tried it, it actually works, and really quickly. I'll need to see what happens if I work on a dozen of files at the same time this way, find out at what point my system dies.
+1  A: 

If you know in advance the position in the file (rather the line number), you can use file.seek() to go to that position.

Edit: you can use the linecache.getline(filename, lineno) function, which will return the contents of the line lineno, but only after reading the entire file into memory. Good if you're randomly accessing lines from within the file (as python itself might want to do to print a traceback) but not good for a 15MB file.

Noah
I would definitely not use linecache for this purpose, because it reads the whole file in memory before returning the requested line.
MiniQuark
Yeah, it sounded too good to be true. I still wish there were a module to do this efficiently, but tend to use the file.seek() method instead.
Noah
+8  A: 

i'm probably spoiled by abundant ram, but 15 Mb is not huge. reading into memory with readlines() is what I usually do with files of this size. accessing a line after that is trivial.

SilentGhost
Why I was slightly hesitant to read entire file -- I might have several of those processes running, and if a dozen of those read 12 files 15MB each it could be not good. But I need to test it to find out if it'll work. Thank you.
Hrm, and what if it's a 1GB file?
Noah
@photographer: even "several" processes reading in 15MB files shouldn't matter on a typical modern machine (depending, of course, on exactly what you're doing with them).
Jacob Gabrielson
Jacob, yes, I should just try. The process(es) is/are running on a virtual machine for weeks if vm is not crashed. Unfortunately last time it crashed after 6 days. I need to continue from where it suddenly stopped. Still need to figure out how to find where it was left.
@Noah: but it is not! Why don't you go further? What if file 128TB? Than many OS wouldn't be able to support it. Why not to solve the problem as they come?
SilentGhost
@SilentGhost: I was hoping to get an answer that might be useful to me, as well. I've cobbled together an indexing scheme for my files, which range from 100MB to nearly 1GB, but an easier and less error-prone solution would be nice.
Noah
+10  A: 

You can't jump ahead without reading in the file at least once, since you don't know where the line breaks are. You could do something like:

# Read in the file once and build a list of line offsets
line_offset = []
offset = 0
for line in file:
    line_offset.append(offset)
    offset += len(line)
file.seek(0)

# Now, to skip to line n (with the first line being line 0), just do
file.seek(line_offset[n])
Adam Rosenfield
+1 I like this! I might try to wrap this in a nice helper.
MiniQuark
+1, but beware that this is only useful if he's gonna jump to several random lines! but if he's only jumping to one line, then this is wasteful
hasen j
+1: Also, if the file doesn't change, the line number index can be pickled and reused, further amortizing the initial cost of scanning the file.
S.Lott
OK, after I jumped there how would I process then line-by-line starting from this position?
One thing to note (particularly on windows): be careful to open the file in binary mode, or alternatively use offset=file.tell(). In text mode on windows, the line will be a byte shorter than it's raw length on disk (\r\n replaced by \n)
Brian
@photographer: Use read() or readline(), they start from the current position as set by seek.
S.Lott
@S.Lott: thank you
+10  A: 

You don't really have that many options if the lines are of different length... you sadly need to process the line ending characters to know when you've progressed to the next line.

You can, however, dramatically speed this up AND reduce memory usage by changing the last parameter to "open" to something not 0.

0 means the file reading operation is unbuffered, which is very slow and disk intensive. 1 means the file is line buffered, which would be an improvement. Anything above 1 (say 8k.. ie: 8096, or higher) reads chunks of the file into memory. You still access it through for line in open(etc):, but python only goes a bit at a time, discarding each buffered chunk after its processed.

Jarret Hardie
8K is 8192, perhaps better to write 8 << 10 to be on the safe side. :)
unwind
Do you by any chance know is buffersize is specified on bytes? What are appropriate format? Could I write '8k'? Or it should be '8096'?
HAHAHA... must be friday... I clearly can't do math. The buffer size is indeed an integer expressing bytes, so write 8192 (not 8096 :-) ), rather than 8
Jarret Hardie
Thank you, Jarret.
My pleasure - hope it works out. On a modern system, you can probably increase the buffer size quite a bit. 8k is just a holdover in my memory for some reason I can't identify.
Jarret Hardie
A: 

If you don't want to read the entire file in memory .. you may need to come up with some format other than plain text.

of course it all depends on what you're trying to do, and how often you will jump across the file.

For instance, if you're gonna be jumping to lines many times in the same file, and you know that the file does not change while working with it, you can do this:
First, pass through the whole file, and record the "seek-location" of some key-line-numbers (such as, ever 1000 lines),
Then if you want line 12005, jump to the position of 12000 (which you've recorded) then read 5 lines and you'll know you're in line 12005 and so on

hasen j
+1  A: 

Since there is no way to determine the lenght of all lines without reading them, you have no choice but to iterate over all lines before your starting line. All you can do is to make it look nice. If the file is really huge then you might want to use a generator based approach:

from itertools import dropwhile

def iterate_from_line(f, start_from_line):
    return (l for i, l in dropwhile(lambda x: x[0] < start_from_line, enumerate(f)))

for line in iterate_from_line(open(filename, "r", 0), 141978):
    DoSomethingWithThisLine(line)

Note: the index is zero based in this approach.

unbeknown
+1  A: 

Do the lines themselves contain any index information? If the content of each line was something like "<line index>:Data", then the seek() approach could be used to do a binary search through the file, even if the amount of Data is variable. You'd seek to the midpoint of the file, read a line, check whether its index is higher or lower than the one you want, etc.

Otherwise, the best you can do is just readlines(). If you don't want to read all 15MB, you can use the sizehint argument to at least replace a lot of readline()s with a smaller number of calls to readlines().

DNS
A: 

Here's an example using 'readlines(sizehint)' to read a chunk of lines at a time. DNS pointed out that solution. I wrote this example because the other examples here are single-line oriented.

def getlineno(filename, lineno):
    if lineno < 1:
        raise TypeError("First line is line 1")
    f = open(filename)
    lines_read = 0
    while 1:
        lines = f.readlines(100000)
        if not lines:
            return None
        if lines_read + len(lines) >= lineno:
            return lines[lineno-lines_read-1]
        lines_read += len(lines)

print getlineno("nci_09425001_09450000.smi", 12000)
Andrew Dalke
A: 

awk 'NR==141978'

Jiayao Yu
+1  A: 

What generates the file you want to process? If it is something under your control, you could generate an index (which line is at which position.) at the time the file is appended to. The index file can be of fixed line size (space padded or 0 padded numbers) and will definitely be smaller. And thus can be read and processed qucikly.

  • Which line do you want?.
  • Calculate byte offset of corresponding line number in index file(possible because line size of index file is constant).
  • Use seek or whatever to directly jump to get the line from index file.
  • Parse to get byte offset for corresponding line of actual file.