ansaurus

Question

parallel file parsing, multiple CPU cores

Answer 1

+3 A:

split the file in 8 smaller files
launch a separate script to process each file
join the results

Why that's the best way...

That's simple and easy - you don't have to program in any way different from linear processing.
You have the best performance by launching a small number of long-running processes.
The OS will deal with context switching and IO multiplexing so you don't have to worry about this stuff (the OS does a good job).
You can scale to multiple machines, without changing the code at all
...

nosklo 2010-10-28 23:03:53

Most effective way to do step 2: bash.

Rafe Kettler 2010-10-28 23:07:16

Answer 2

+1 A:

cPython does not provide the threading model you are looking for easily. You can get something similar using the multiprocessing module and a process pool

such a solution could look something like this:

def worker(lines):
    """Make a dict out of the parsed, supplied lines"""
    result = {}
    for line in lines.split('\n'):
        k, v = parse(line)
        result[k] = v
    return result

if __name__ == '__main__':
    # configurable options.  different values may work better.
    numthreads = 8
    numlines = 100

    lines = open('input.txt').readlines()

    # create the process pool
    pool = multiprocessing.Pool(processes=numthreads)

    # map the list of lines into a list of result dicts
    result_list = pool.map(worker, 
        (lines[line:line+numlines] for line in xrange(0,len(lines),numlines) ) )

    # reduce the result dicts into a single dict
    result = {}
    map(result.update, result_list)

TokenMacGuy 2010-10-28 23:06:53

and using processes is better anyway for this use case

nosklo 2010-10-28 23:08:16

code using `multiprocessing` performs significantly better on operating systems with `fork` (linux) than without (windows) if the amount of shared state (the dict returned by `worker()`) because on those platforms, shared data must be pickled and sent over a pipe in the child process and unpickled in the parent process.

TokenMacGuy 2010-10-28 23:35:02

Answer 3

A:

You can use the multiprocessing module, but if parse() is quick, you won't get much performance improvement by doing that.

kindall 2010-10-28 23:08:14

Answer 4

A:

As TokenMacGuy said, You can use multiprocessing module. If You really need to parse massive amount of data, You should check out the disco project.

Disco is a distributed computing framework based on the MapReduce paradigm. Disco is open-source; developed by Nokia Research Center to solve real problems in handling massive amounts of data.

It really scales up for jobs where Your parse() job is "pure" (i.e., doesn't use any shared resources) and is CPU intensive. I tested a job on a single core and then compared to running it on 3 hosts with 8 cores each. It actually ran 24 times faster when run on the Disco cluster (NOTE: tested for an unreasonably CPU intensive job).

Reef 2010-10-28 23:58:51

ansaurus

tags:

views:

answers:

parallel file parsing, multiple CPU cores

related questions