views:

599

answers:

9

I am currently designing an application that has one module which will load large amounts of data from a database and reduce it to a much smaller set by various calculations depending on the circumstances.

Many of the more intensive operations behave deterministically and would lend themselves to parallel processing.

Provided I have a loop that iterates over a large number of data chunks arriving from the db and for each one call a deterministic function without side effects, how would I make it so that the program does not wait for the function to return but rather sets the next calls going, so they could be processed in parallel? A naive approach to demonstrate the principle would do me for now.

I have read Google's MapReduce paper and while I could use the overall principle in a number of places, I won't, for now, target large clusters, rather it's going to be a single multi-core or multi-CPU machine for version 1.0. So currently, I'm not sure if I can actually use the library or would have to roll a dumbed-down basic version myself.

I am at an early stage of the design process and so far I am targeting C-something (for the speed critical bits) and Python (for the productivity critical bits) as my languages. If there are compelling reasons, I might switch, but so far I am contented with my choice.

Please note that I'm aware of the fact that it might take longer to retrieve the next chunk from the database than to process the current one and the whole process would then be I/O-bound. I would, however, assume for now that it isn't and in practice use a db cluster or memory caching or something else to be not I/O-bound at this point.

+3  A: 

Well, if .net is an option, they have put a lot of effort into Parallel Computing.

EBGreen
+3  A: 

You can implement the algorithm from Google's MapReduce without having physically separate machines. Just consider each of those "machines" to be "threads." Threads are automatically distributed on multi-core machines.

Jason Cohen
+2  A: 

If you're working with a compiler that will support it, I would suggest taking a look at http://www.openmp.org for a way of annotating your code in such a way that certain loops will be parallelized.

It does a lot more as well, and you might find it very helpful.

Their web page reports that gcc4.2 will support openmp, for example.

Thomas Kammeyer
+2  A: 

I might be missing something here, but this this seems fairly straight forward using pthreads.

Set up a small threadpool with N threads in it and have one thread to control them all.

The master thread simply sits in a loop doing something like:

  1. Get data chunk from DB
  2. Find next free thread If no thread is free then wait
  3. Hand over chunk to worker thread
  4. Go back and get next chunk from DB

In the meantime the worker threads they sit and do:

  1. Mark myself as free
  2. Wait for the mast thread to give me a chunk of data
  3. Process the chunk of data
  4. Mark myself as free again

The method by which you implement this can be as simple as two mutex controlled arrays. One has the worked threads in it (the threadpool) and the other indicated if each corresponding thread is free or busy.

Tweak N to your liking ...

Christian
+3  A: 

If you still plan on using Python, you might want to have a look at Processing. It uses processes rather than threads for parallel computing (due to the Python GIL) and provides classes for distributing "work items" onto several processes. Using the pool class, you can write code like the following:

import processing

def worker(i):
    return i*i
num_workers = 2
pool = processing.Pool(num_workers)
result = pool.imap(worker, range(100000))

This is a parallel version of itertools.imap, which distributes calls over to processes. You can also use the apply_async methods of the pool and store lazy result objects in a list:

results = []
for i in range(10000):
    results.append(pool.apply_async(worker, i))

For further reference, see the documentation of the Pool class.

Gotchas:

  • processing uses fork(), so you have to be careful on Win32
  • objects transferred between processes need to be pickleable
  • if the workers are relatively fast, you can tweak chunksize, i.e. the number of work items send to a worker process in one batch
  • processing.Pool uses a background thread
Torsten Marek
A: 

The same thread pool is used in java. But the threads in threadpools are serialisable and sent to other computers and deserialised to run.

A: 

I have developed a MapReduce library for multi-threaded/multi-core use on a single server. Everything is taken care of by the library, and the user just has to implement Map and Reduce. It is positioned as a Boost library, but not yet accepted as a formal lib. Check out http://www.craighenderson.co.uk/mapreduce

CraigH
A: 

You may be interested in examining the code of libdispatch, which is the open source implementation of Apple's Grand Central Dispatch.

mouviciel
A: 

Intel's TBB or boost::mpi might be of interest to you also.

piotr