views:

266

answers:

3

I need to read some very huge text files (100+ Mb), process every lines with regex and store the data into a structure. My structure inherits from defaultdict, it has a read(self) method that read self.file_name file.

Look at this very simple (but not real) example, I'm not using regex, but I'm splitting lines:


import multiprocessing
from collections import defaultdict

def SingleContainer():
    return list()

class Container(defaultdict):
    """
    this class store odd line in self["odd"] and even line in self["even"].
    It is stupid, but it's only an example. In the real case the class
    has additional methods that do computation on readen data.
    """
    def __init__(self,file_name):
        if type(file_name) != str:
            raise AttributeError, "%s is not a string" % file_name
        defaultdict.__init__(self,SingleContainer)
        self.file_name = file_name
        self.readen_lines = 0
    def read(self):
        f = open(self.file_name)
        print "start reading file %s" % self.file_name
        for line in f:
            self.readen_lines += 1
            values = line.split()
            key = {0: "even", 1: "odd"}[self.readen_lines %2]
            self[key].append(values)
        print "readen %d lines from file %s" % (self.readen_lines, self.file_name)

def do(file_name):
    container = Container(file_name)
    container.read()
    return container.items()

if __name__ == "__main__":
    file_names = ["r1_200909.log", "r1_200910.log"]
    pool = multiprocessing.Pool(len(file_names))
    result = pool.map(do,file_names)
    pool.close()
    pool.join()
    print "Finish"      

At the end I need to join every results in a single Container. It is important that the order of the lines is preserved. My approach is too slow when returning values. Better solution? I'm using python 2.6 on Linux

A: 

Multiprocessing is more suited to CPU- or memory-oriented processes since the seek time of rotational drives kills performance when switching between files. Either load your log files into a fast flash drive or some sort of memory disk (physical or virtual), or give up on multiprocessing.

Ignacio Vazquez-Abrams
my problem is cpu-bounded and not IO-bounded. In this example I'm splitting lines, but in the real case I'm working with a complex and long regex and the IO time (seek, ...) is much minor then the cpu time
wiso
A: 

You're creating a pool with as many workers as files. That may be too many. Usually, I aim to have the number of workers around the same as the number of cores.

The simple fact is that your final step is going to be a single process merging all the results together. There is no avoiding this, given your problem description. This is known as a barrier synchronization: all tasks have to reach the same point before any can proceed.

You should probably run this program multiple times, or in a loop, passing a different value to multiprocessing.Pool() each time, starting at 1 and going to the number of cores. Time each run, and see which worker count does best.

The result will depend on how CPU-intensive (as opposed to disk-intensive) your task is. I would not be surprised if 2 were best if your task is about half CPU and half disk, even on an 8-core machine.

Mike DeSimone
Yes I already did it. My choice was not a random choice. I tried to measure time without the return line, and the best choice was when the number of processes is equal to the number of file even if the number of process is greater than the number of cores.
wiso
Then I don't see how you can do better. The killer is: "It is important that the order of the lines is preserved." This can only be done one input at a time, even though you preprocessed each file independently. Your alternative would be to have each worker generate a file with a suffix, and have whatever reads these files read them in order, so the merge gets eliminated.
Mike DeSimone
+2  A: 

You're probably hitting two problems.

One of them was mentioned: you're reading multiple files at once. Those reads will end up being interleaved, causing disk thrashing. You want to read whole files at once, and then only multithread the computation on the data.

Second, you're hitting the overhead of Python's multiprocessing module. It's not actually using threads, but instead starting multiple processes and serializing the results through a pipe. That's very slow for bulk data--in fact, it seems to be slower than the work you're doing in the thread (at least in the example). This is the real-world problem caused by the GIL.

If I modify do() to return None instead of container.items() to disable the extra data copy, this example is faster than a single thread, as long as the files are already cached:

Two threads: 0.36elapsed 168%CPU

One thread (replace pool.map with map): 0:00.52elapsed 98%CPU

Unfortunately, the GIL problem is fundamental and can't be worked around from inside Python.

Glenn Maynard
Yes, this is the problem: return data. I'm using multiprocessing and not multithreading because of GIL. But I want to optimize my program using all the core of my cpu! If I mesure the time from "Start reading file" and "Readeng %d lines" (ignoring the return time) the multiprocessing version is 2 times faster than the single process version (I've 2 core). Now: what about shared memory? I looked at the multiprocess.Manager class, but I want to share a structure more complex than dict.
wiso
I havn't used Manager, but it looks like it proxies manipulation of data, so I suspect it's even slower. You can use shared memory to share simple blocks of memory, but not native Python types. You might want to look for other optimizations, but without any actual code I can't make any suggestions.
Glenn Maynard
maybe a solution can be: rewrite the read function in c++ and use real multithreading with c++? With this approach can I bypass the problem to share data between processes (pipe)?
wiso
You'll always need to hold the GIL while you construct the Python data structures, and it's a lot more work (on your part) to do your parsing and constructing the results in C than in Python. I can't say whether this would be a good idea or not, but it sounds like a mess.
Glenn Maynard