ansaurus

Question

Python text file processing speed issues

Answer 1

+8 A:

In Python (at least <= 2.6.x), gzip format parsing is implemented in Python (over zlib). More, it appears to be doing some strange things, namely, decompress to the end of file to memory and then discard everything beyond the requested read size (then do it again for next read). DISCLAIMER: I've just looked at gzip.read() for 3 minutes, so I can be wrong here. Regardless of whether my understanding of gzip.read() is correct or not, gzip module appears to be not optimized for large data volumes. Try doing the same thing as in Perl, i.e. launching an external process (e.g. see module subprocess).

EDIT Actually, I missed the OP's remark about plain file I/O being just as slow as compressed (thanks to ire_and_curses for pointing it out). This striken me as unlikely, so I did some measurements...

from timeit import Timer

def w(n):
    L = "*"*80+"\n"
    with open("ttt", "w") as f:
        for i in xrange(n) :
            f.write(L)

def r():
    with open("ttt", "r") as f:
        for n,line in enumerate(f) :
            if n % 1000000 == 0 :
                print n

def g():
    f = gzip.open("ttt.gz", "r")
    for n,line in enumerate(f) :
        if n % 1000000 == 0 :
        print n

Now, running it...

>>> Timer("w(10000000)", "from __main__ import w").timeit(1)
14.153118133544922
>>> Timer("r()", "from __main__ import r").timeit(1)
1.6482770442962646
# here i switched to a terminal and made ttt.gz from ttt
>>> Timer("g()", "from __main__ import g").timeit(1)

...and after having a tea break and discovering that it's still running, I've killed it, sorry. Then I tried 100'000 lines instead of 10'000'000:

>>> Timer("w(100000)", "from __main__ import w").timeit(1)
0.05810999870300293
>>> Timer("r()", "from __main__ import r").timeit(1)
0.09662318229675293
# here i switched to a terminal and made ttt.gz from ttt
>>> Timer("g()", "from __main__ import g").timeit(1)
11.939290046691895

Module gzip's time is O(file_size**2), so with number of lines on the order of millions, gzip read time just cannot be the same as plain read time (as we see confirmed by an experiment). Anonymouslemming, please check again.

atzz 2010-04-22 15:35:11

This is interesting, but isn't it beside the point? Whether this is true or not, the OP points out that looping over the uncompressed file (and therefore not using the `gzip` module at all) in Python is also slow.

ire_and_curses 2010-04-22 16:53:52

@ire_and_curses -- thanks for pointing this out, I'll update my answer.

atzz 2010-04-22 19:17:48

That's really bizzare about `gzip.open`. Is this an expected behaviour by design?

Santa 2010-04-22 19:51:45

@Santa -- good question... I guess the module was not intended for processing large volumes of data (and on small files it's fast enough), but I don't know.

atzz 2010-04-22 20:42:45

@Santa: If you have to ask, then it's never expected behaviour.

BlueRaja - Danny Pflughoeft 2010-04-22 20:52:30

Apologies, I was wrong about the uncompressed time... I've re-run my uncompressed test, and the time is 4m14.972s. This is still longer than the equivalent perl run time to process the gzipped file, and the uncompress process takes quite a long time on this hardware.For reference, the file is about 25,000,000 lines (around 2.5GB).

Anonymouslemming 2010-04-23 07:32:31

Answer 2

A:

It took your computer 10 minutes? It must be your hardware. I wrote this function to write 5 million lines:

def write():
    fout = open('log.txt', 'w')
    for i in range(5000000):
        fout.write(str(i/3.0) + "\n")
    fout.close

Then I read it with a program much like yours:

def read():
    fin = open('log.txt', 'r')
    counter = 0
    for line in fin:
        counter += 1
        if counter % 1000000 == 0:
            print counter
    fin.close

It took my computer about 3 seconds to read all 5 million lines.

dln385 2010-04-22 20:07:34

But how long did it take to read with perl on your hardware? It's the relative, not absolute, performance the OP was asking about.

Corey Porter 2010-04-22 21:52:12

We develop on virtual machines and deploy to physicals. Like @Cory said, I'm not worried about standalone performance, I'm worried about the difference between Python and Perl here.

Anonymouslemming 2010-04-23 08:57:43

@Anonymouslemming You should take any performance number that touches the disk within a VM with a grain of salt. Disk IO is by far the weakest part of VMs. I've seen VMs lock up for hours or so on an fsync.

mikelikespie 2010-07-08 17:35:06

Answer 3

+4 A:

If you Google "why is python gzip slow" you'll find plenty of discussion of this, including patches for improvements in Python 2.7 and 3.2. In the meantime, use zcat as you did in Perl which is wicked fast. Your (first) function takes me about 4.19s with a 5MB compressed file, and the second function takes 0.78s. However, I don't know what's going on with your uncompressed files. If I uncompress the log files (apache logs) and run the two function on them with a simple Python open(file), and Popen('cat'), Python is faster (0.17s) than cat (0.48s).

#!/usr/bin/python

import gzip
from subprocess import PIPE, Popen
import sys
import timeit

#pathToLog = 'big.log.gz' # 50M compressed (*10 uncompressed)
pathToLog = 'small.log.gz' # 5M ""

def test_ori():
    counter = 0
    f = gzip.open(pathToLog, 'r')
    for line in f:
        counter = counter + 1
        if (counter % 100000 == 0): # 1000000
            print counter, line
    f.close

def test_new():
    counter = 0
    content = Popen(["zcat", pathToLog], stdout=PIPE).communicate()[0].split('\n')
    for line in content:
        counter = counter + 1
        if (counter % 100000 == 0): # 1000000
            print counter, line

if '__main__' == __name__:
    to = timeit.Timer('test_ori()', 'from __main__ import test_ori')
    print "Original function time", to.timeit(1)

    tn = timeit.Timer('test_new()', 'from __main__ import test_new')
    print "New function time", tn.timeit(1)

Reagle 2010-04-22 20:42:57

Quick question ... What version of python ships with subprocess? On my box I am getting Traceback (most recent call last): File "./fileperformance.py", line 4, in ? from subprocess import PIPE, PopenImportError: No module named subprocess

Anonymouslemming 2010-04-23 09:36:40

Ah - that's very unhelpful... Ships with 2.4 and RHEL4 only has 2.3.4 :(

Anonymouslemming 2010-04-23 09:54:43

Well, you could just us backticks or os.system() then.

Reagle 2010-04-23 12:35:15

Or rather, os.popen()

janneb 2010-04-23 21:14:45

Answer 4

+1 A:

I spent a while on this. Hopefully this code will do the trick. It uses zlib and no external calls.

The gunzipchunks method reads the compressed gzip file in chunks which can be iterated over (generator).

The gunziplines method reads these uncompressed chunks and provides you with one line at a time which can also be iterated over (another generator).

Finally, the gunziplinescounter method gives you what you're looking for.

Cheers!

import zlib

file_name = 'big.txt.gz'
#file_name = 'mini.txt.gz'

#for i in gunzipchunks(file_name): print i
def gunzipchunks(file_name,chunk_size=4096):
    inflator = zlib.decompressobj(16+zlib.MAX_WBITS)
    f = open(file_name,'rb')
    while True:
        packet = f.read(chunk_size)
        if not packet: break
        to_do = inflator.unconsumed_tail + packet
        while to_do:
            decompressed = inflator.decompress(to_do, chunk_size)
            if not decompressed:
                to_do = None
                break
            yield decompressed
            to_do = inflator.unconsumed_tail
    leftovers = inflator.flush()
    if leftovers: yield leftovers
    f.close()

#for i in gunziplines(file_name): print i
def gunziplines(file_name,leftovers="",line_ending='\n'):
    for chunk in gunzipchunks(file_name): 
        chunk = "".join([leftovers,chunk])
        while line_ending in chunk:
            line, leftovers = chunk.split(line_ending,1)
            yield line
            chunk = leftovers
    if leftovers: yield leftovers

def gunziplinescounter(file_name):
    for counter,line in enumerate(gunziplines(file_name)):
        if (counter % 1000000 != 0): continue
        print "%12s: %10d" % ("checkpoint", counter)
    print "%12s: %10d" % ("final result", counter)
    print "DEBUG: last line: [%s]" % (line)

gunziplinescounter(file_name)

This should run a whole lot faster than using the builtin gzip module on extremely large files.

Unblogworthy 2010-04-22 21:05:12

ansaurus

tags:

views:

answers:

Python text file processing speed issues

related questions