views:

77

answers:

4

I asked a related but very general question earlier (see especially this response).

This question is very specific. This is all the code I care about:

result = {}
for line in open('input.txt'):
  key, value = parse(line)
  result[key] = value

The function parse is completely self-contained (i.e., doesn't use any shared resources).

I have Intel i7-920 CPU (4 cores, 8 threads; I think the threads are more relevant, but I'm not sure).

What can I do to make my program use all the parallel capabilities of this CPU?

I assume I can open this file for reading in 8 different threads without much performance penalty since disk access time is small relative to the total time.

+3  A: 
  1. split the file in 8 smaller files
  2. launch a separate script to process each file
  3. join the results

Why that's the best way...

  • That's simple and easy - you don't have to program in any way different from linear processing.
  • You have the best performance by launching a small number of long-running processes.
  • The OS will deal with context switching and IO multiplexing so you don't have to worry about this stuff (the OS does a good job).
  • You can scale to multiple machines, without changing the code at all
  • ...
nosklo
Most effective way to do step 2: bash.
Rafe Kettler
+1  A: 

cPython does not provide the threading model you are looking for easily. You can get something similar using the multiprocessing module and a process pool

such a solution could look something like this:

def worker(lines):
    """Make a dict out of the parsed, supplied lines"""
    result = {}
    for line in lines.split('\n'):
        k, v = parse(line)
        result[k] = v
    return result

if __name__ == '__main__':
    # configurable options.  different values may work better.
    numthreads = 8
    numlines = 100

    lines = open('input.txt').readlines()

    # create the process pool
    pool = multiprocessing.Pool(processes=numthreads)

    # map the list of lines into a list of result dicts
    result_list = pool.map(worker, 
        (lines[line:line+numlines] for line in xrange(0,len(lines),numlines) ) )

    # reduce the result dicts into a single dict
    result = {}
    map(result.update, result_list)
TokenMacGuy
and using processes is better anyway for this use case
nosklo
code using `multiprocessing` performs significantly better on operating systems with `fork` (linux) than without (windows) if the amount of shared state (the dict returned by `worker()`) because on those platforms, shared data must be pickled and sent over a pipe in the child process and unpickled in the parent process.
TokenMacGuy
A: 

You can use the multiprocessing module, but if parse() is quick, you won't get much performance improvement by doing that.

kindall
A: 

As TokenMacGuy said, You can use multiprocessing module. If You really need to parse massive amount of data, You should check out the disco project.

Disco is a distributed computing framework based on the MapReduce paradigm. Disco is open-source; developed by Nokia Research Center to solve real problems in handling massive amounts of data.

It really scales up for jobs where Your parse() job is "pure" (i.e., doesn't use any shared resources) and is CPU intensive. I tested a job on a single core and then compared to running it on 3 hosts with 8 cores each. It actually ran 24 times faster when run on the Disco cluster (NOTE: tested for an unreasonably CPU intensive job).

Reef