views:

427

answers:

4

Hi all,

I'm having a problem with processing a largeish file in Python. All I'm doing is

f = gzip.open(pathToLog, 'r')
for line in f:
        counter = counter + 1
        if (counter % 1000000 == 0):
                print counter
f.close

This takes around 10m25s just to open the file, read the lines and increment this counter.

In perl, dealing with the same file and doing quite a bit more (some regular expression stuff), the whole process takes around 1m17s.

Perl Code:

open(LOG, "/bin/zcat $logfile |") or die "Cannot read $logfile: $!\n";
while (<LOG>) {
        if (m/.*\[svc-\w+\].*login result: Successful\.$/) {
                $_ =~ s/some regex here/$1,$2,$3,$4/;
                push @an_array, $_
        }
}
close LOG;

Can anyone advise what I can do to make the Python solution run at a similar speed to the Perl solution?

EDIT I've tried just uncompressing the file and dealing with it using open instead of gzip.open, but that only changes the total time to around 4m14.972s, which is still too slow.

I also removed the modulo and print statements and replaced them with pass, so all that is being done now is moving from file to file.

+8  A: 

In Python (at least <= 2.6.x), gzip format parsing is implemented in Python (over zlib). More, it appears to be doing some strange things, namely, decompress to the end of file to memory and then discard everything beyond the requested read size (then do it again for next read). DISCLAIMER: I've just looked at gzip.read() for 3 minutes, so I can be wrong here. Regardless of whether my understanding of gzip.read() is correct or not, gzip module appears to be not optimized for large data volumes. Try doing the same thing as in Perl, i.e. launching an external process (e.g. see module subprocess).


EDIT Actually, I missed the OP's remark about plain file I/O being just as slow as compressed (thanks to ire_and_curses for pointing it out). This striken me as unlikely, so I did some measurements...

from timeit import Timer

def w(n):
    L = "*"*80+"\n"
    with open("ttt", "w") as f:
        for i in xrange(n) :
            f.write(L)

def r():
    with open("ttt", "r") as f:
        for n,line in enumerate(f) :
            if n % 1000000 == 0 :
                print n

def g():
    f = gzip.open("ttt.gz", "r")
    for n,line in enumerate(f) :
        if n % 1000000 == 0 :
        print n

Now, running it...

>>> Timer("w(10000000)", "from __main__ import w").timeit(1)
14.153118133544922
>>> Timer("r()", "from __main__ import r").timeit(1)
1.6482770442962646
# here i switched to a terminal and made ttt.gz from ttt
>>> Timer("g()", "from __main__ import g").timeit(1)

...and after having a tea break and discovering that it's still running, I've killed it, sorry. Then I tried 100'000 lines instead of 10'000'000:

>>> Timer("w(100000)", "from __main__ import w").timeit(1)
0.05810999870300293
>>> Timer("r()", "from __main__ import r").timeit(1)
0.09662318229675293
# here i switched to a terminal and made ttt.gz from ttt
>>> Timer("g()", "from __main__ import g").timeit(1)
11.939290046691895

Module gzip's time is O(file_size**2), so with number of lines on the order of millions, gzip read time just cannot be the same as plain read time (as we see confirmed by an experiment). Anonymouslemming, please check again.

atzz
This is interesting, but isn't it beside the point? Whether this is true or not, the OP points out that looping over the uncompressed file (and therefore not using the `gzip` module at all) in Python is also slow.
ire_and_curses
@ire_and_curses -- thanks for pointing this out, I'll update my answer.
atzz
That's really bizzare about `gzip.open`. Is this an expected behaviour by design?
Santa
@Santa -- good question... I guess the module was not intended for processing large volumes of data (and on small files it's fast enough), but I don't know.
atzz
@Santa: If you have to ask, then it's never expected behaviour.
BlueRaja - Danny Pflughoeft
Apologies, I was wrong about the uncompressed time... I've re-run my uncompressed test, and the time is 4m14.972s. This is still longer than the equivalent perl run time to process the gzipped file, and the uncompress process takes quite a long time on this hardware.For reference, the file is about 25,000,000 lines (around 2.5GB).
Anonymouslemming
A: 

It took your computer 10 minutes? It must be your hardware. I wrote this function to write 5 million lines:

def write():
    fout = open('log.txt', 'w')
    for i in range(5000000):
        fout.write(str(i/3.0) + "\n")
    fout.close

Then I read it with a program much like yours:

def read():
    fin = open('log.txt', 'r')
    counter = 0
    for line in fin:
        counter += 1
        if counter % 1000000 == 0:
            print counter
    fin.close

It took my computer about 3 seconds to read all 5 million lines.

dln385
But how long did it take to read with perl on your hardware? It's the relative, not absolute, performance the OP was asking about.
Corey Porter
We develop on virtual machines and deploy to physicals. Like @Cory said, I'm not worried about standalone performance, I'm worried about the difference between Python and Perl here.
Anonymouslemming
@Anonymouslemming You should take any performance number that touches the disk within a VM with a grain of salt. Disk IO is by far the weakest part of VMs. I've seen VMs lock up for hours or so on an fsync.
mikelikespie
+4  A: 

If you Google "why is python gzip slow" you'll find plenty of discussion of this, including patches for improvements in Python 2.7 and 3.2. In the meantime, use zcat as you did in Perl which is wicked fast. Your (first) function takes me about 4.19s with a 5MB compressed file, and the second function takes 0.78s. However, I don't know what's going on with your uncompressed files. If I uncompress the log files (apache logs) and run the two function on them with a simple Python open(file), and Popen('cat'), Python is faster (0.17s) than cat (0.48s).

#!/usr/bin/python

import gzip
from subprocess import PIPE, Popen
import sys
import timeit

#pathToLog = 'big.log.gz' # 50M compressed (*10 uncompressed)
pathToLog = 'small.log.gz' # 5M ""

def test_ori():
    counter = 0
    f = gzip.open(pathToLog, 'r')
    for line in f:
        counter = counter + 1
        if (counter % 100000 == 0): # 1000000
            print counter, line
    f.close

def test_new():
    counter = 0
    content = Popen(["zcat", pathToLog], stdout=PIPE).communicate()[0].split('\n')
    for line in content:
        counter = counter + 1
        if (counter % 100000 == 0): # 1000000
            print counter, line

if '__main__' == __name__:
    to = timeit.Timer('test_ori()', 'from __main__ import test_ori')
    print "Original function time", to.timeit(1)

    tn = timeit.Timer('test_new()', 'from __main__ import test_new')
    print "New function time", tn.timeit(1)
Reagle
Quick question ... What version of python ships with subprocess? On my box I am getting Traceback (most recent call last): File "./fileperformance.py", line 4, in ? from subprocess import PIPE, PopenImportError: No module named subprocess
Anonymouslemming
Ah - that's very unhelpful... Ships with 2.4 and RHEL4 only has 2.3.4 :(
Anonymouslemming
Well, you could just us backticks or os.system() then.
Reagle
Or rather, os.popen()
janneb
+1  A: 

I spent a while on this. Hopefully this code will do the trick. It uses zlib and no external calls.

The gunzipchunks method reads the compressed gzip file in chunks which can be iterated over (generator).

The gunziplines method reads these uncompressed chunks and provides you with one line at a time which can also be iterated over (another generator).

Finally, the gunziplinescounter method gives you what you're looking for.

Cheers!

import zlib

file_name = 'big.txt.gz'
#file_name = 'mini.txt.gz'

#for i in gunzipchunks(file_name): print i
def gunzipchunks(file_name,chunk_size=4096):
    inflator = zlib.decompressobj(16+zlib.MAX_WBITS)
    f = open(file_name,'rb')
    while True:
        packet = f.read(chunk_size)
        if not packet: break
        to_do = inflator.unconsumed_tail + packet
        while to_do:
            decompressed = inflator.decompress(to_do, chunk_size)
            if not decompressed:
                to_do = None
                break
            yield decompressed
            to_do = inflator.unconsumed_tail
    leftovers = inflator.flush()
    if leftovers: yield leftovers
    f.close()

#for i in gunziplines(file_name): print i
def gunziplines(file_name,leftovers="",line_ending='\n'):
    for chunk in gunzipchunks(file_name): 
        chunk = "".join([leftovers,chunk])
        while line_ending in chunk:
            line, leftovers = chunk.split(line_ending,1)
            yield line
            chunk = leftovers
    if leftovers: yield leftovers

def gunziplinescounter(file_name):
    for counter,line in enumerate(gunziplines(file_name)):
        if (counter % 1000000 != 0): continue
        print "%12s: %10d" % ("checkpoint", counter)
    print "%12s: %10d" % ("final result", counter)
    print "DEBUG: last line: [%s]" % (line)

gunziplinescounter(file_name)

This should run a whole lot faster than using the builtin gzip module on extremely large files.

Unblogworthy