views:

1770

answers:

14

I am working on a program which manipulates images of different sizes. Many of these manipulations read pixel data from an input and write to a separate output (e.g. blur). This is done on a per-pixel basis.

Such image mapulations are very stressful on the CPU. I would like to use multithreading to speed things up. How would I do this? I was thinking of creating one thread per row of pixels.

I have several requirements:

  • Executable size must be minimized. In other words, I can't use massive libraries. What's the most light-weight, portable threading library for C/C++?
  • Executable size must be minimized. I was thinking of having a function forEachRow(fp* ) which runs a thread for each row, or even a forEachPixel(fp* ) where fp operates on a single pixel in its own thread. Which is best?
    • Should I use normal functions or functors or functionoids or some lambda functions or ... something else?
    • Some operations use optimizations which require information from the previous pixel processed. This makes forEachRow favorable. Would using forEachPixel be better even considering this?
  • Would I need to lock my read-only and write-only arrays?
    • The input is only read from, but many operations require input from more than one pixel in the array.
    • The ouput is only written once per pixel.
  • Speed is also important (of course), but optimize executable size takes precedence.

Thanks.

More information on this topic for the curious: C++ Parallelization Libraries: OpenMP vs. Thread Building Blocks

+4  A: 

I would recommend boost::thread and boost::gil (generic image libray). Because there are quite much templates involved, I'm not sure whether the code-size will still be acceptable for you. But it's part of boost, so it is probably worth a look.

Johannes Schaub - litb
I would +1, but I don't have any votes left. xD I'll look into boost::thread. boost::gil doesn't seem to be what I'm looking for. We already have a stable image framework optimized for code size.
strager
+1  A: 

I don't think you want to have one thread per row. There can be a lot of rows, and you will spend lot of memory/CPU resources just launching/destroying the threads and for the CPU to switch from one to the other. Moreover, if you have P processors with C core, you probably won't have a lot of gain with more than C*P threads.

I would advise you to use a defined number of client threads, for example N threads, and use the main thread of your application to distribute the rows to each thread, or they can simply get instruction from a "job queue". When a thread has finished with a row, it can check in this queue for another row to do.

As for libraries, you can use boost::thread, which is quite portable and not too heavyweight.

ckarmann
Thanks for your answer. I never considered a job queue; good idea. Also, I forgot that there's a hardware limit as well -- thanks for pointing that out. boost::thread::hardware_concurrency is what I need, correct?
strager
I think OpenMP would be an easier way to do this. It'll do all the work of spawning threads and divying up loop iterations for you.
tgamblin
A: 

One thread per pixel row is insane, best have around n-1 to 2n threads (for n cpu's), and make each one loop fetching one jobunit (may be one row, or other kind of partition)

on unix-like, use pthreads it's simple and lightweight.

Javier
+1  A: 

Can I ask which platform you're writing this for? I'm guessing that because executable size is an issue you're not targetting on a desktop machine. In which case does the platform have multiple cores or hyperthreaded? If not then adding threads to your application could have the opposite effect and slow it down...

We are targeting all kinds of platforms, ranging from desktop computers to handheld devices. Multithreading is important for the desktops, obviously. Size is an issue for reasons relating to packaging, not hardware.
strager
You might want to consider offering different solutions to different architectures. Multithread programs are faster if you do have the cores, for a handheld (were performance is even more important) it will slow down the processing.
David Rodríguez - dribeas
@dribeas, I do realize that multithreading on handhelds doesn't increase performance (and will probably decrease performance). That's what #ifdef is for, right? =]
strager
+1  A: 

As a bit of a left-field idea...

What systems are you running this on? Have you thought of using the GPU in your PCs?

Nvidia have the CUDA APIs for this sort of thing

Paul
Using GLSL and other shader API's has been considered, but the project manager doesn't want that to be used. The project may be ported to devices without these technologies, and we want to maintain portability.
strager
A: 

Maybe write your own tiny library which implements a few standard threading functions using #ifdef's for every platform? There really isn't much to it, and that would reduce the executable size way more than any library you could use.

Update: And for work distribution - split your image into pieces and give each thread a piece. So that when it's done with the piece, it's done. This way you avoid implementing job queues that will further increase your executable's size.

Vilx-
+8  A: 

If your compiler supports OpenMP (I know VC++ 8.0 and 9.0 do, as does gcc), it can make things like this much easier to do.

You don't just want to make a lot of threads - there's a point of diminishing returns where adding new threads slows things down as you start getting more and more context switches. At some point, using too many threads can actually make the parallel version slower than just using a linear algorithm. The optimal number of threads is a function of the number of cpus/cores available, and the percentage of time each thread spends blocked on things like I/O. Take a look at this article by Herb Sutter for some discussion on parallel performance gains.

OpenMP lets you easily adapt the number of threads created to the number of CPUs available. Using it (especially in data-processing cases) often involves simply putting in a few #pragma omps in existing code, and letting the compiler handle creating threads and synchronization.

In general - as long as data isn't changing, you won't have to lock read-only data. If you can be sure that each pixel slot will only be written once and you can guarantee that all the writing has been completed before you start reading from the result, you won't have to lock that either.

For OpenMP, there's no need to do anything special as far as functors / function objects. Write it whichever way makes the most sense to you. Here's an image-processing example from Intel (converts rgb to grayscale):

#pragma omp parallel for
for (i=0; i < numPixels; i++)
{
   pGrayScaleBitmap[i] = (unsigned BYTE)
       (pRGBBitmap[i].red * 0.299 +
        pRGBBitmap[i].green * 0.587 +
        pRGBBitmap[i].blue * 0.114);
}

This automatically splits up into as many threads as you have CPUs, and assigns a section of the array to each thread.

Eclipse
I like the concept of OpenMP. Sadly, I don't have a compiler which supports it. I'll have to ask my project manager if using compiler-specific options is okay. OpenMP code runs unthreaded on non-OpenMP compilers, correct?
strager
Yes. It should.
tgamblin
Just tested on gcc-4.3.2; works like a charm. Many thanks! (Hopefully it'll work well for Windows and Mac OS X as well...)
strager
You're wrong about the writing case. See my post below under "fencepost" issues where one thread writes part of a memory word, and another thread writes the other part. Different bytes for each thread, but part of the same memory word.
Mr.Ree
@mrree.myopenid.com, My data is all word-aligned (and not "packed") so this shouldn't be an issue, right?
strager
Such errors may be low-probability. They may not even be possible. But given the potential, and the outright impossibility of deterministic reproduction, let alone debugging, I would err on the side of caution.
Mr.Ree
Put another way: In theory you should be safe. Then again, in theory, there is no difference between theory and practice. In practice, there usually is.
Mr.Ree
A: 
Seriously? You don't think there is inherent parallelism in image processing? Do you know what GPUs do?
tgamblin
Each image manipulation takes anywhere from 8 seconds to 6 minutes of my 1.7GHz 64-bit system, depending on the manipulation chosen and the parameters. I would really like to bring that 6 minutes down. Even 3 minutes (my system is dual core) is acceptable.
strager
I have to agree with @tgamblin's comment. GPU's have literally hundreds of stream processors operating to render the final image. These have many inputs (texture data, lighting parameters, and shader code, for example) and produce one output (a texel). I am doing something similar.
strager
Using assembly isn't really an option -- I need portable code. Also, assembly is more difficult to maintain. I'm looking for a clean solution which doesn't require much code to be changed to implement the optimization.
strager
+6  A: 

Don't embark on threading lightly! The race conditions can be a major pain in the arse to figure out. Especially if you don't have a lot of experience with threads! (You've been warned: Here be dragons! Big hairy non-deterministic impossible-to-reliably-reproduce dragons!)

Do you know what deadlock is? How about Livelock?

That said...


As ckarmann and others have already suggested: Use a work-queue model. One thread per CPU core. Break the work up into N chunks. Make the chunks reasonably large, like many rows. As each thread becomes free, it snags the next work chunk off the queue.

In the simplest IDEAL version, you have N cores, N threads, and N subparts of the problem with each thread knowing from the start exactly what it's going to do.

But that doesn't usually happen in practice due to the overhead of starting/stopping threads. You really want the threads to already be spawned and waiting for action. (E.g. Through a semaphore.)

The work-queue model itself is quite powerful. It lets you parallelize things like quick-sort, which normally doesn't parallelize across N threads/cores gracefully.


More threads than cores? You're just wasting overhead. Each thread has overhead. Even at #threads=#cores, you will never achieve a perfect Nx speedup factor.

One thread per row would be very inefficient! One thread per pixel? I don't even want to think about it. (That per-pixel approach makes a lot more sense when playing with vectorized processor units like they had on the old Crays. But not with threads!)


Libraries? What's your platform? Under Unix/Linux/g++ I'd suggest pthreads & semaphores. (Pthreads is also available under windows with a microsoft compatibility layer. But, uhgg. I don't really trust it! Cygwin might be a better choice there.)

Under Unix/Linux, man:

* pthread_create, pthread_detach.
* pthread_mutexattr_init, pthread_mutexattr_settype, pthread_mutex_init,
* pthread_mutexattr_destroy, pthread_mutex_destroy, pthread_mutex_lock,
* pthread_mutex_trylock, pthread_mutex_unlock, pthread_mutex_timedlock.
* sem_init, sem_destroy, sem_post, sem_wait, sem_trywait, sem_timedwait.

Some folks like pthreads' condition variables. But I always preferred POSIX 1003.1b semaphores. They handle the situation where you want to signal another thread BEFORE it starts waiting somewhat better. Or where another thread is signaled multiple times.

Oh, and do yourself a favor: Wrap your thread/mutex/semaphore pthread calls into a couple of C++ classes. That will simplify matters a lot!


Would I need to lock my read-only and write-only arrays?

It depends on your precise hardware & software. Usually read-only arrays can be freely shared between threads. But there are cases where that is not so.

Writing is much the same. Usually, as long as only one thread is writing to each particular memory spot, you are ok. But there are cases where that is not so!

Writing is more troublesome than reading as you can get into these weird fencepost situations. Memory is often written as words not bytes. When one thread writes part of the word, and another writes a different part, depending on the exact timing of which thread does what when (e.g. nondeterministic), you can get some very unpredictable results!

I'd play it safe: Give each thread its own copy of the read and write areas. After they are done, copy the data back. All under mutex, of course.

Unless you are talking about gigabytes of data, memory blits are very fast. That couple of microseconds of performance time just isn't worth the debugging nightmare.

If you were to share one common data area between threads using mutexes, the collision/waiting mutex inefficiencies would pile up and devastate your efficiency!


Look, clean data boundaries are the essence of good multi-threaded code. When your boundaries aren't clear, that's when you get into trouble.

Similarly, it's essential to keep everything on the boundary mutexed! And to keep the mutexed areas short!

Try to avoid locking more than one mutex at the same time. If you do lock more than one mutex, always lock them in the same order!

Where possible use ERROR-CHECKING or RECURSIVE mutexes. FAST mutexes are just asking for trouble, with very little actual (measured) speed gain.

If you get into a deadlock situation, run it in gdb, hit ctrl-c, visit each thread and backtrace. You can find the problem quite quickly that way. (Livelock is much harder!)


One final suggestion: Build it single-threaded, then start optimizing. On a single-core system, you may find yourself gaining more speed from things like foo[i++]=bar ==> *(foo++)=bar than from threading.


Addendum: What I said about keeping mutexed areas short up above? Consider two threads: (Given a global shared mutex object of a Mutex class.)

/*ThreadA:*/ while(1){  mutex.lock();  printf("a\n");  usleep(100000); mutex.unlock(); }
/*ThreadB:*/ while(1){  mutex.lock();  printf("b\n");  usleep(100000); mutex.unlock(); }

What will happen?

Under my version of Linux, one thread will run continuously and the other will starve. Very very rarely they will change places when a context swap occurs between mutex.unlock() and mutex.lock().


Addendum: In your case, this is unlikely to be an issue. But with other problems one may not know in advance how long a particular work-chunk will take to complete. Breaking a problem down into 100 parts (instead of 4 parts) and using a work-queue to split it up across 4 cores smooths out such discrepancies.

If one work-chunk takes 5 times longer to complete than another, well, it all evens out in the end. Though with too many chunks, the overhead of acquiring new work-chunks creates noticeable delays. It's a problem-specific balancing act.

Mr.Ree
Thanks for your post; you've cleared up a lot of questions I've had with threading in general. Yes, we are dealing with "gigabytes of data" (more like 64MB blocks), which are all word aligned. I'm going with OpenMP for individual image processing, but when the time comes I'll use "normal" threads.
strager
A: 

I think regardless of the threading model you choose (boost, pthread, native threads, etc). I think you should consider a thread pool as opposed to a thread per row. Threads in a thread pool are very cheap to "start" since they are already created as far as the OS is concerned, it's just a matter of giving it something to do.

Basically, you could have say 4 threads in your pool. Then in a serial fashion, for each pixel, tell the next thread in the thread pool to process the pixel. This way you are effectively processing no more than 4 pixels at a time. You could make the size of the pool based either on user preference or on the number of CPUs the system reports.

This is by far the simplest way IMHO to add threading to a SIMD task.

Evan Teran
Sounds good. I'll implement this along side an OpenMP implementation and compare. Thanks for your clarification on how thread pools work. =]
strager
Just a note, it's entirely possible (I honestly don't know) that OpenMP uses thread pools already to a reasonable degree. So OpenMP is likely going to work nicely for you as well.
Evan Teran
+1  A: 

To optimize simple image transformations, you are far better off using SIMD vector math than trying to multi-thread your program.

HUAGHAGUAH
I use use SIMD (e.g. SSE) operations for simple transforms. However, normal mapping and blur are a little more complicated and require more computations.
strager
A: 

It is very possible, that bootleneck is not CPU but memory bandwidth, so multithreading WONT help a lot. Try to minimize memory access and work on limited memory blocks, so that more data can be cached. I had similar problem a while ago and i decided to optimize my code to use SSE instructions. Speed increace was almost 4x per single thread! Sincerely, 0xDEAD BEEF

0xDEAD BEEF
A: 

Your compiler doesn't support OpenMP. Another option is to use a library approach, both Intel's Threading Building Blocks and Microsoft Concurrency Runtime are available (VS 2010).

There is also a set of interfaces called the Parallel Pattern Library which are supported by both libraries and in these have a templated parallel_for library call. so instead of:

#pragma omp parallel for 
for (i=0; i < numPixels; i++) 
{ ...} 

you would write:

parallel_for(0,numPixels,1,ToGrayScale());

where ToGrayScale is a functor or pointer to function. (Note if your compiler supports lambda expressions which it likely doesn't you can inline the functor as a lambda expression).

parallel_for(0,numPixels,1,[&](int i)
{  
   pGrayScaleBitmap[i] = (unsigned BYTE)  
       (pRGBBitmap[i].red * 0.299 +  
        pRGBBitmap[i].green * 0.587 +  
        pRGBBitmap[i].blue * 0.114);  
});

-Rick

Rick
A: 

I think map/reduce framework will be the ideal thing to use in this situation. You can use Hadoop streaming to use your existing C++ application.

Just implement the map and reduce jobs.

As you said, you can use row-level maniputations as a map task and combine the row level manipulations to the final image in the reduce task.

Hope this is useful.

Algorist