views:

8847

answers:

11

I need to get a line count of a large file (hundreds of thousands of lines) in python. What is the most efficient way both memory- and time-wise?

At the moment I do:

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

is it possible to do any better?

+9  A: 

You could execute a subprocess and run wc -l filename

import subprocess

def file_len(fname):
    p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, 
                                              stderr=subprocess.PIPE)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0])
Ólafur Waage
what would be the windows version of this?
SilentGhost
http://gnuwin32.sourceforge.net/packages/coreutils.htm
cartman
You can refer to this SO question regarding that. http://stackoverflow.com/questions/247234/do-you-know-a-similar-program-for-wc-unix-word-count-command-on-windows
Ólafur Waage
Indeed, in my case (Mac OS X) this takes 0.13s versus 0.5s for counting the number of lines "for x in file(...)" produces, versus 1.0s counting repeated calls to str.find or mmap.find. (The file I used to test this has 1.3 million lines.)
bendin
Exactly what i thought.
Ólafur Waage
No need to involve the shell on that. edited answer and added example code;
nosklo
Nice : )
Ólafur Waage
On command line (without the overhead of creating another shell) this is the same fast as the more clear and portable python-only solution. See also: http://stackoverflow.com/questions/849058/is-it-possible-to-speed-up-python-io
Davide
+4  A: 
def file_len(full_path):
  """ Count number of lines in a file."""
  f = open(full_path)
  nr_of_lines = sum(1 for line in f)
  f.close()
  return nr_of_lines
pkit
This is just syntactic sugar for the solution OP already has
Yuval A
Do you have any timing data to show this is faster?
Kiv
+12  A: 

You can't get any better than that.

After all, any solution will have to read the entire file, figure out how many \n you have, and return that result.

Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.

Yuval A
Exactly, even WC is reading through the file, but in C and it's probably pretty optimized.
Ólafur Waage
As far as I understand the Python file IO is done through C as well. http://docs.python.org/library/stdtypes.html#file-objects
Tomalak
+1  A: 

As for me this variant will be the fastest:


#!/usr/bin/env python

def main():
    f = open('filename')                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    print lines

if __name__ == '__main__':
    main()


reasons: buffering faster than reading line by line and string.count is also very fast

Mykola Kharechko
But is it? At least on OSX/python2.5 the OP's version is still about 10% faster according to timeit.py.
dF
maybe, i don't test it.
Mykola Kharechko
What if the last line does not end in '\n'?
ΤΖΩΤΖΙΟΥ
I don't know how you tested it, dF, but on my machine it's ~2.5 times slower than any other option.
SilentGhost
You state that it will be the fastest and then state that you haven't tested it. Not very scientific eh? :)
Ólafur Waage
A: 

the result of opening a file is an iterator, which can be converted to a sequence, which has a length:

with open(filename) as f:
   return len(list(f))

this is more concise than your explicit loop, and avoids the enumerate.

Andrew Jaffe
which means that 100 Mb file will need to be read into the memory.
SilentGhost
yep, good point, although I wonder about the speed (as opposed to memory) difference. It's probably possible to create an iterator that does this, but I think it would be equivalent to your solution.
Andrew Jaffe
this is nasty in terms of memory...
Yuval A
Nice idea. I was about to suggest something similar.
Tony
-1, it's not just the memory, but having to construct the list in memory.
orip
+1  A: 

What about this

def file_len(fname):
  counts = itertools.count()
  with open(fname) as f: 
    for _ in f: counts.next()
  return counts.next()
odwl
A: 

Why not read the first 100 and the last 100 lines and estimate the average line length, then divide the total file size through that numbers? If you don't need a exact value this could work.

Georg
I need a exact value, but the problem is that in general case line length could be fairly different. I'm afraid though that your approach won't be the most efficient one.
SilentGhost
+28  A: 

I believe that a memory mapped file will be the fastest solution. I tried four functions: the function posted by the OP (opcount); a simple iteration over the lines in the file (simplecount); readline with a memory-mapped filed (mmap) (mapcount); and the buffer read solution offered by Mykola Kharechko (bufcount).

I ran each function five times, and calculated the average run-time for a 1.2 million-line text file.

Windows XP, Python 2.5, 2GB RAM, 2 GHz AMD processor

Here are my results:

mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714

Edit: numbers for Python 2.6:

mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297

So the buffer read strategy seems to be the fastest for Windows/Python 2.6

Here is the code:

from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict

def mapcount(filename):
    f = open(filename, "r+")
    buf = mmap.mmap(f.fileno(), 0)
    lines = 0
    readline = buf.readline
    while readline():
        lines += 1
    return lines

def simplecount(filename):
    lines = 0
    for line in open(filename):
        lines += 1
    return lines

def bufcount(filename):
    f = open(filename)                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    return lines

def opcount(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1


counts = defaultdict(list)

for i in range(5):
    for func in [mapcount, simplecount, bufcount, opcount]:
        start_time = time.time()
        assert func("big_file.txt") == 1209138
        counts[func].append(time.time() - start_time)

for key, vals in counts.items():
    print key.__name__, ":", sum(vals) / float(len(vals))
Ryan Ginstrom
it's interesting, because I'm seeing different numbers there. What is an actual size of your file in bytes?
SilentGhost
The file size is 53,064,630 bytes.
Ryan Ginstrom
As I've said before, bufcount is incredibly slow on my machine (up to 6 times). mapcount is indeed the fastest, second only to wc -l solution (http://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/845069#845069). The only drawback I see is that extra 100 Mb of memory consumed, which depending on one's setup might be fairly appropriate. I thinks your answer well deserves upvote :)
SilentGhost
The entire memory-mapped file isn't loaded into memory. You get a virtual memory space, which the OS swaps into and out of RAM as needed. Here's how they're handled on Windows: http://msdn.microsoft.com/en-us/library/ms810613.aspx
Ryan Ginstrom
Sorry, here's a more general reference on memory-mapped files:http://en.wikipedia.org/wiki/Memory-mapped_fileAnd thanks for the vote. :)
Ryan Ginstrom
Even though it's just a virtual memory, it is precisely what limits this approach and therefore it won't work for huge files. I've tried it with ~1.2 Gb file with over 10 mln. lines (as obtained with wc -l) and just got a WindowsError: [Error 8] Not enough storage is available to process this command. of course, this is a edge case.
SilentGhost
+1 for real timing data. Do we know if the buffer size of 1024*1024 is optimal, or is there a better one?
Kiv
+3  A: 

One line, probably pretty fast:

num_lines = sum(1 for line in open('myfile.txt'))
Kyle
A: 

Just to complete the above methods I tried a variant with the fileinput module:

import fileinput as fi   
def filecount(fname):
        for line in fi.input(fname):
            pass
        return fi.lineno()

And passed a 60mil lines file to all the above stated methods:

mapcount : 6.1331050396
simplecount : 4.588793993
opcount : 4.42918205261
filecount : 43.2780818939
bufcount : 0.170812129974

It's a little surprise to me that fileinput is that bad and scales far worse than all the other methods...

BandGap
A: 

what about this?

>>> import sys
>>> sys.stdin=open('fname','r')
>>> data=sys.stdin.readlines()
>>> print "counted",len(data),"lines"
S.C
I don't think it addresses the fact that the large file is being read into the memory.
SilentGhost