ansaurus

Question

How to get line count cheaply in Python?

Answer 1

+9 A:

You could execute a subprocess and run wc -l filename

import subprocess

def file_len(fname):
    p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, 
                                              stderr=subprocess.PIPE)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0])

Ólafur Waage 2009-05-10 10:28:29

what would be the windows version of this?

SilentGhost 2009-05-10 10:30:25

http://gnuwin32.sourceforge.net/packages/coreutils.htm

cartman 2009-05-10 10:32:54

You can refer to this SO question regarding that. http://stackoverflow.com/questions/247234/do-you-know-a-similar-program-for-wc-unix-word-count-command-on-windows

Ólafur Waage 2009-05-10 10:32:58

Indeed, in my case (Mac OS X) this takes 0.13s versus 0.5s for counting the number of lines "for x in file(...)" produces, versus 1.0s counting repeated calls to str.find or mmap.find. (The file I used to test this has 1.3 million lines.)

bendin 2009-05-10 12:06:33

Exactly what i thought.

Ólafur Waage 2009-05-10 12:20:15

No need to involve the shell on that. edited answer and added example code;

nosklo 2009-05-11 12:23:17

Nice : )

Ólafur Waage 2009-05-11 12:40:14

On command line (without the overhead of creating another shell) this is the same fast as the more clear and portable python-only solution. See also: http://stackoverflow.com/questions/849058/is-it-possible-to-speed-up-python-io

Davide 2009-05-11 17:34:16

Answer 2

+4 A:

def file_len(full_path):
  """ Count number of lines in a file."""
  f = open(full_path)
  nr_of_lines = sum(1 for line in f)
  f.close()
  return nr_of_lines

pkit 2009-05-10 10:33:41

This is just syntactic sugar for the solution OP already has

Yuval A 2009-05-10 10:38:10

Do you have any timing data to show this is faster?

Kiv 2009-06-19 20:03:02

Answer 3

+12 A:

You can't get any better than that.

After all, any solution will have to read the entire file, figure out how many \n you have, and return that result.

Do you have a better way of doing that without reading the entire file? Not sure... The best solution will always be I/O-bound, best you can do is make sure you don't use unnecessary memory, but it looks like you have that covered.

Yuval A 2009-05-10 10:37:42

Exactly, even WC is reading through the file, but in C and it's probably pretty optimized.

Ólafur Waage 2009-05-10 10:38:37

As far as I understand the Python file IO is done through C as well. http://docs.python.org/library/stdtypes.html#file-objects

Tomalak 2009-05-10 10:41:07

Answer 4

+1 A:

As for me this variant will be the fastest:


#!/usr/bin/env python

def main():
    f = open('filename')                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    print lines

if __name__ == '__main__':
    main()

reasons: buffering faster than reading line by line and string.count is also very fast

Mykola Kharechko 2009-05-10 11:29:12

But is it? At least on OSX/python2.5 the OP's version is still about 10% faster according to timeit.py.

dF 2009-05-10 11:47:53

maybe, i don't test it.

Mykola Kharechko 2009-05-10 11:54:47

What if the last line does not end in '\n'?

ΤΖΩΤΖΙΟΥ 2009-05-11 13:21:52

I don't know how you tested it, dF, but on my machine it's ~2.5 times slower than any other option.

SilentGhost 2009-05-11 16:25:24

You state that it will be the fastest and then state that you haven't tested it. Not very scientific eh? :)

Ólafur Waage 2009-05-11 18:37:50

Answer 5

A:

the result of opening a file is an iterator, which can be converted to a sequence, which has a length:

with open(filename) as f:
   return len(list(f))

this is more concise than your explicit loop, and avoids the enumerate.

Andrew Jaffe 2009-05-10 11:35:26

which means that 100 Mb file will need to be read into the memory.

SilentGhost 2009-05-10 11:41:11

yep, good point, although I wonder about the speed (as opposed to memory) difference. It's probably possible to create an iterator that does this, but I think it would be equivalent to your solution.

Andrew Jaffe 2009-05-10 11:53:43

this is nasty in terms of memory...

Yuval A 2009-05-10 17:18:13

Nice idea. I was about to suggest something similar.

Tony 2009-05-15 15:46:53

-1, it's not just the memory, but having to construct the list in memory.

orip 2009-09-21 21:14:07

Answer 6

+1 A:

What about this

def file_len(fname):
  counts = itertools.count()
  with open(fname) as f: 
    for _ in f: counts.next()
  return counts.next()

odwl 2009-05-10 18:20:28

Answer 7

A:

Why not read the first 100 and the last 100 lines and estimate the average line length, then divide the total file size through that numbers? If you don't need a exact value this could work.

Georg 2009-05-10 18:36:20

I need a exact value, but the problem is that in general case line length could be fairly different. I'm afraid though that your approach won't be the most efficient one.

SilentGhost 2009-05-10 18:50:42

Answer 8

+28 A:

I believe that a memory mapped file will be the fastest solution. I tried four functions: the function posted by the OP (opcount); a simple iteration over the lines in the file (simplecount); readline with a memory-mapped filed (mmap) (mapcount); and the buffer read solution offered by Mykola Kharechko (bufcount).

I ran each function five times, and calculated the average run-time for a 1.2 million-line text file.

Windows XP, Python 2.5, 2GB RAM, 2 GHz AMD processor

Here are my results:

mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714

Edit: numbers for Python 2.6:

mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297

So the buffer read strategy seems to be the fastest for Windows/Python 2.6

Here is the code:

from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict

def mapcount(filename):
    f = open(filename, "r+")
    buf = mmap.mmap(f.fileno(), 0)
    lines = 0
    readline = buf.readline
    while readline():
        lines += 1
    return lines

def simplecount(filename):
    lines = 0
    for line in open(filename):
        lines += 1
    return lines

def bufcount(filename):
    f = open(filename)                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    return lines

def opcount(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1


counts = defaultdict(list)

for i in range(5):
    for func in [mapcount, simplecount, bufcount, opcount]:
        start_time = time.time()
        assert func("big_file.txt") == 1209138
        counts[func].append(time.time() - start_time)

for key, vals in counts.items():
    print key.__name__, ":", sum(vals) / float(len(vals))

Ryan Ginstrom 2009-05-12 02:49:04

it's interesting, because I'm seeing different numbers there. What is an actual size of your file in bytes?

SilentGhost 2009-05-12 12:25:12

The file size is 53,064,630 bytes.

Ryan Ginstrom 2009-05-12 12:36:32

As I've said before, bufcount is incredibly slow on my machine (up to 6 times). mapcount is indeed the fastest, second only to wc -l solution (http://stackoverflow.com/questions/845058/how-to-get-line-count-cheaply-in-python/845069#845069). The only drawback I see is that extra 100 Mb of memory consumed, which depending on one's setup might be fairly appropriate. I thinks your answer well deserves upvote :)

SilentGhost 2009-05-12 14:21:24

The entire memory-mapped file isn't loaded into memory. You get a virtual memory space, which the OS swaps into and out of RAM as needed. Here's how they're handled on Windows: http://msdn.microsoft.com/en-us/library/ms810613.aspx

Ryan Ginstrom 2009-05-12 14:38:47

Sorry, here's a more general reference on memory-mapped files:http://en.wikipedia.org/wiki/Memory-mapped_fileAnd thanks for the vote. :)

Ryan Ginstrom 2009-05-12 14:45:40

Even though it's just a virtual memory, it is precisely what limits this approach and therefore it won't work for huge files. I've tried it with ~1.2 Gb file with over 10 mln. lines (as obtained with wc -l) and just got a WindowsError: [Error 8] Not enough storage is available to process this command. of course, this is a edge case.

SilentGhost 2009-05-12 16:24:33

+1 for real timing data. Do we know if the buffer size of 1024*1024 is optimal, or is there a better one?

Kiv 2009-06-19 20:07:14

Answer 9

+3 A:

One line, probably pretty fast:

num_lines = sum(1 for line in open('myfile.txt'))

Kyle 2009-06-19 19:07:06

Answer 10

A:

Just to complete the above methods I tried a variant with the fileinput module:

import fileinput as fi   
def filecount(fname):
        for line in fi.input(fname):
            pass
        return fi.lineno()

And passed a 60mil lines file to all the above stated methods:

mapcount : 6.1331050396
simplecount : 4.588793993
opcount : 4.42918205261
filecount : 43.2780818939
bufcount : 0.170812129974

It's a little surprise to me that fileinput is that bad and scales far worse than all the other methods...

BandGap 2010-05-05 11:48:21

Answer 11

A:

what about this?

>>> import sys
>>> sys.stdin=open('fname','r')
>>> data=sys.stdin.readlines()
>>> print "counted",len(data),"lines"

S.C 2010-06-25 15:17:32

I don't think it addresses the fact that the large file is being read into the memory.

SilentGhost 2010-06-25 16:54:58

ansaurus

tags:

views:

answers:

How to get line count cheaply in Python?

related questions