views:

177

answers:

5

Hi all,

I have an image processing routine that I believe could be made very parallel very quickly. Each pixel needs to have roughly 2k operations done on it in a way that doesn't depend on the operations done on neighbors, so splitting the work up into different units is fairly straightforward.

My question is, what's the best way to approach this change such that I get the quickest speedup bang-for-the-buck?

Ideally, the library/approach I'm looking for should meet these criteria:

  1. Still be around in 5 years. Something like CUDA or ATI's variant may get replaced with a less hardware-specific solution in the not-too-distant future, so I'd like something a bit more robust to time. If my impression of CUDA is wrong, I welcome the correction.
  2. Be fast to implement. I've already written this code and it works in a serial mode, albeit very slowly. Ideally, I'd just take my code and recompile it to be parallel, but I think that that might be a fantasy. If I just rewrite it using a different paradigm (ie, as shaders or something), then that would be fine too.
  3. Not require too much knowledge of the hardware. I'd like to be able to not have to specify the number of threads or operational units, but rather to have something automatically figure all of that out for me based on the machine being used.
  4. Be runnable on cheap hardware. That may mean a $150 graphics card, or whatever.
  5. Be runnable on Windows. Something like GCD might be the right call, but the customer base I'm targeting won't switch to Mac or Linux any time soon. Note that this does make the response to the question a bit different than to this other question.

What libraries/approaches/languages should I be looking at? I've looked at things like OpenMP, CUDA, GCD, and so forth, but I'm wondering if there are other things I'm missing.

I'm leaning right now to something like shaders and opengl 2.0, but that may not be the right call, since I'm not sure how many memory accesses I can get that way-- those 2k operations require accessing all the neighboring pixels in a lot of ways.

A: 

Have you seen Intel's (Open Source) Threading Building Blocks?

sbi
I have not, I shall check it out.
mmr
+1  A: 

Easiest way is probably to divide your picture into the number of parts that you can process in parallel (4, 8, 16, depending on cores). Then just run a different process for each part.

In terms of doing this specifically, take a look at OpenCL. It will hopefully be around for longer since it's not vendor specific and both NVidia and ATI want to support it.

In general, since you don't need to share too much data, the process if really pretty straightforward.

CookieOfFortune
I'll take a look at it. Does OpenCL require that I specify the number of cores? I'm kind of hoping to just break everything into 'work units' and leave it at that.
mmr
Well, just develop your algorithm to be operable with an arbitrary number of cores.
CookieOfFortune
A: 

I haven't used it, but take a look at Cilk. One of the big wigs on their team is Charles E. Leiserson; he is the "L" in CLRS, the most widely/respected used Algorithms book on the planet. I think it caters well to your requirements.

From my brief readings, all you have to do is "tag" your existing code and then run it thru their compiler which will automatically/seamlessly parallelize the code. This is their big selling point, so you dont need to start from scratch with parallelism in mind, unlike other options (like OpenMP).

imran.fanaswala
A: 

Hi

If you already have a working serial code in one of C, C++ or Fortran, you should give serious consideration to OpenMP. One of its big advantages over a lot of other parallelisation libraries / languages / systems / whatever, is that you can parallelise a loop at a time which means that you can get useful speed-up without having to re-write or, worse, re-design, your program.

In terms of your requirements:

  1. OpenMP is much used in high-performance computing, there's a lot of 'weight' behind it and an active development community -- www.openmp.org.

  2. Fast enough to implement if you're lucky enough to have chosen C, C++ or Fortran.

  3. OpenMP implements a shared-memory approach to parallel computing, so a big plus in the 'don't need to understand hardware' argument. You can leave the program to figure out how many processors it has at run time, then distribute the computation across whatever is available, another plus.

  4. Runs on the hardware you already have, no need for expensive, or cheap, additional graphics cards.

  5. Yep, there are implementations for Windows systems.

Of course, if you were unwise enough to have not chosen C, C++ or Fortran in the beginning a lot of this advice will only apply after you have re-written it into one of those languages !

Regards

Mark

High Performance Mark
+1  A: 

I would also recommend Threading Building Blocks. We use this with the Intel® Integrated Performance Primitives for the image analysis at the company I work for.

Threading Building Blocks(TBB) is similar to both OpenMP and Cilk. And it uses OpenMP to do the multithreading, it is just wrapped in a simpler interface. With it you don't have to worry about how many threads to make, you just define tasks. It will split the tasks, if it can, to keep everything busy and it does the load balancing for you.

Intel Integrated Performance Primitives(Ipp) has optimized libraries for vision. Most of which are multithreaded. For the functions we need that aren't in the IPP we thread them using TBB.

Using these, we obtain the best result when we use the IPP method for creating the images. What it does is it pads each row so that any given cache line is entirely contained in one row. Then we don't divvy up a row in the image across threads. That way we don't have false sharing from two threads trying to write to the same cache line.

Ed_S
I'm familiar with IPP, but found the multithreading/tiling code to be less than useful for me (my images are all ushorts, not uint8). If TBB fixes this problem, then that's pretty exciting...
mmr
We use images that are ushorts as well as uint8. TBB works with either one. In fact, most of the functions that we have written our selves are templates that accept both types and use TBB.
Ed_S
To clarify, IPP has memory allocator for the different types. These allocators make sure that every cache line is entirely contained in one row. They do this by padding the end of each row so that the cache line is full. It wastes a little bit of memory, but on a 1025 wide by 1024 tall image it is only 3%. This is the worse case. Most often you will waste less memory.
Ed_S