tags:

views:

340

answers:

8

I have 2 simple questions about python:

1.How to get number of lines of a file in python?

2.How to locate the position in a file object to the last line easily?

+8  A: 

lines are just data delimited by the newline char '\n'.

1) Since lines are variable length, you have to read the entire file to know where the newline chars are, so you can count how many lines:

count = 0
for line in open('myfile'):
    count += 1
print count, line # it will be the last line

2) reading a chunk from the end of the file is the fastest method to find the last newline char.

def seek_newline_backwards(file_obj, eol_char='\n', buffer_size=200):
    if not file_obj.tell(): return # already in beginning of file
    # All lines end with \n, including the last one, so assuming we are just 
    # after one end of line char
    file_obj.seek(-1, os.SEEK_CUR)
    while file_obj.tell():
        ammount = min(buffer_size, file_obj.tell())
        file_obj.seek(-ammount, os.SEEK_CUR)
        data = file_obj.read(ammount)
        eol_pos = data.rfind(eol_char)
        if eol_pos != -1:
            file_obj.seek(eol_pos - len(data) + 1, os.SEEK_CUR)
            break
        file_obj.seek(-len(data), os.SEEK_CUR)

You can use that like this:

f = open('some_file.txt')
f.seek(0, os.SEEK_END)
seek_newline_backwards(f)
print f.tell(), repr(f.readline())
nosklo
uh... but what if the last line is more than 200 chars from EOF?
Triptych
sometimes, lines are instead delimited by \r; you might want to take that into account.
Michael Borgwardt
@Michael Borgwardt: Good point, modified the code to take that into account, now the char used is a parameter to the function.
nosklo
What if the file is 4GB and consists of a single line?
Ayman Hourieh
Obviously Ayman is half-joking, but if one cares about file- and available-RAM- size, then the next step is to worry about corner cases like the one Ayman described.
ΤΖΩΤΖΙΟΥ
@ΤΖΩΤΖΙΟΥ - It was an honest question. I wanted to point out that this solution is also vulnerable to memory exhaustion. If you are concerned about large files, you should also be concerned about files with very long lines.
Ayman Hourieh
+1  A: 

The only way to count lines [that I know of] is to read all lines, like this:

count = 0
for line in open("file.txt"): count = count + 1

After the loop, count will have the number of lines read.

grawity
A: 

Answer to the first question (beware of poor performance on large files when using this method):

f = open("myfile.txt").readlines()
print len(f) - 1

Answer to the second question:

f = open("myfile.txt").read()
print f.rfind("\n")

P.S. Yes I do understand that this only suits for small files and simple programs. I think I will not delete this answer however useless for real use-cases it may seem.

David Parunakian
that reads the entire file to the memory at once.
nosklo
I know, I have specifically edited the answer to mention that.
David Parunakian
that also reads the entire file to a string, and then creates a list of strings splitted, speding at least 2 times the file size in memory. I'm not sure why one would use this method.
nosklo
you should at least use .readlines()
nosklo
+2  A: 

For small files that fit memory, how about using str.count() for getting the number of lines of a file:

line_count = open("myfile.txt").read().count('\n')
gimel
that will read the entire file to memory at once, so I guess a for loop is better.
nosklo
Man, it's 2009. Don't be tied up by old-fashioned limits.
Charlie Martin
@Charlie Martin: I have to deal with text files easily up to 4GB. And it is not tying me up, it is just better to read each line at a time instead of the entire file, even if it *fits* on memory. The OP is a beginner and should learn good practices that work regardless of the file size.
nosklo
The answer clearly says "for small files that fit in memory" -- and besides, when is the last time you've had a myfile.txt that couldn't fit in memory? :-)
Martin Geisler
The answer specifies that it's for "small files that fit memory", so I think that the answer is acceptable.
ΤΖΩΤΖΙΟΥ
+7  A: 

Let's not forget

f = open("myfile.txt")
lines = f.readlines()

numlines = len(lines)
lastline = lines[-1]

NOTE: this reads the whole file in memory as a list. Keep that in mind in the case that the file is very large.

Charlie Martin
that also reads the entire file to the memory at once.
nosklo
Yes, and? Back when I was writing business apps in 8K of memory, I might have cared.
Charlie Martin
@Charlie Martin: 1) What if the file is 4GB? 2) What if I am already running another app that's using my memory, and I have only a few MB available? Should I hit virtual memory (swap)? Really?
nosklo
@nosklo: Then you would change your algorithm. What is your point? There is no 'one size fits all' best solution for every problem on the planet. +1 for simplicity and explicitness.
Nick Presta
@Charlie - I have to agree with nosklo here. Assuming the file will always fit in memory is the sort of lazy programming that can easily lead to vulnerabilites and instability.
Triptych
@Nick Presta: Well, in Zen of Python we have: "There should be one and only one obvious way to do it". In this case, that fits, since doing a straight-forward loop is *simpler*.
nosklo
@noskio, "premature optimization is the root of all evil." I mean, what if the file is encrypted? What if it's binary data? As to whether then I'd hit virtual memory, well, yeah, that's what its for. Oddly, in general the swapper is more efficient than file I/O.
Charlie Martin
nosklo preaches caution and I agree. How big is the file? How much RAM does the OP —and any other viewer of this question— have available? We can't know the answer to these questions, so why risk it? In any case, this answer should make clear that this reads the whole file into memory, like nosklo suggested.
ΤΖΩΤΖΙΟΥ
+5  A: 

The easiest way is simply to read the file into memory. eg:

f = open('filename.txt')
lines = f.readlines()
num_lines = len(lines)
last_line = lines[-1]

However for big files, this may use up a lot of memory, as the whole file is loaded into RAM. An alternative is to iterate through the file line by line. eg:

f = open('filename.txt')
num_lines = sum(1 for line in f)

This is more efficient, since it won't load the entire file into memory, but only look at a line at a time. If you want the last line as well, you can keep track of the lines as you iterate and get both answers by:

f = open('filename.txt')
count=0
last_line = None
for line in f:
    num_lines += 1
    last_line = line
print "There were %d lines.  The last was: %s" % (num_lines, last_line)

One final possible improvement if you need only the last line, is to start at the end of the file, and seek backwards until you find a newline character. Here's a question which has some code doing this. If you need both the linecount as well though, theres no alternative except to iterate through all lines in the file however.

Brian
how is reading the entire file easiest? your second solution looks much more easy
nosklo
easy does not mean fast or efficient :-p
fortran
+2  A: 

I'd like too add to the other solutions that some of them (those who look for \n) will not work with files with OS 9-style line endings (\r only), and that they may contain an extra blank line at the end because lots of text editors append it for some curious reasons, so you might or might not want to add a check for it.

Etienne Perot
right. using a for won't have this problem since python's readline() already deals with that.
nosklo
FYI - OS-X uses a single '\n' http://en.wikipedia.org/wiki/Newline
JimB
Right, um, OS 9 and lower then. I never knew Apple had changed its mind, good thing they did~
Etienne Perot
A: 

For the first question there're already a few good ones, I'll suggest @Brian's one as the best (most pythonic, line ending character proof and memory efficient):

f = open('filename.txt')
num_lines = sum(1 for line in f)

For the second one, I like @nosklo's one, but modified to be more general should be:

import os
f = open('myfile')
to = f.seek(0, os.SEEK_END)
found = -1
while found == -1 and to > 0:
  fro = max(0, to-1024)
  f.seek(fro)
  chunk = f.read(to-fro)
  found = chunk.rfind("\n")
  to -= 1024

if found != -1:
  found += fro

It seachs in chunks of 1Kb from the end of the file, until it finds a newline character or the file ends. At the end of the code, found is the index of the last newline character.

fortran