views:

67

answers:

2

Sorry for the vague topic question, but I'm working on some academic video processing routines. The algorithms are written in MATLAB, and while it's fine for development purposes, it processed a video at about 60spf, or around .0166fps. Needless to say, this wont be sufficient for demos and such, so my summer job is to convert the routine to something that will run drastically faster.

I have rewritten the slowest portion of the code for CUDA, nvidia's GPGPU solution. However, there is also a large portion of the code that seems to be better done on the CPU, as it is relatively serial. The problem is, the machine I was given has 2 Xeon processors, with 8 logical cores total, and it seems to be a shame to bottleneck the GPU code by coding only for single core. The video conversion process is functional in that each frame does not depend on other frames, so I was thinking some kind of asynchronous queue/stream would best.

Here lies my question: what would be the best way to achieve this type of parallelism with the best ratio of effort to speed yield?

Some of the solutions I've looked at are OpenMP, .net TPL, and just simple pthreads.

I only have basic exposure to asynchronous programming, so I would rather use a library or something rather than mess around with mutexes and barriers and shoot myself in the foot multiple times. I don't mind learning, because that's one of my goals for this summer, but at the same time, parallelism is hard. However, if the speed difference is actually very noticeable, I'm willing to pull my hair out for a couple weeks. :P

Thanks in advance.

+2  A: 

If maximizing effort to yield is your goal, I would recommend looking at the TPL in .NET. This is probably the simplest way to implement this. Depending on what your code is doing, you could either form a pipeline or just even use Parallel.For (or ForEach) on each "frame".

That being said, if you want to stick to native, non-managed code, a good option might be Microsoft's new Parallel Patterns Library or Intel's Threading Building Blocks. They both have similar constructs to the new TPL, especially for data parallelism, and would make this fairly easy to parallelize, as long as "each frame does not depend on other frames" remains true.

Reed Copsey
Beat me to it! Ade Miller had a good talk at TechEd on the parallelism options available with .Net 4.0: http://www.msteched.com/2010/NorthAmerica/ARC205
Mathias
PPL seems just like what I've been looking for, thanks.One question though, does using managed code like C# slow down code like mine noticeably, where it's mainly floating point arithmetic over large arrays? I like the much simplified programming environment, but with things like video processing, I'm always hesitant because of fears of garbage collection and bound checking overhead.It might be just old C programming paranoia though :\
Xzhsh
Xzhsh: I personally use C# and managed code for scientific data processing in my "day job". It does very well, but the perf. characteristics are different than native code - so you have to adapt your thinking to compensate. I wouldn't, personally, worry about the GC being an issue, but array bounds checking can slow you down (this can be disabled). In most cases, though, careful profiling and "good" managed code can lead to code that's as fast (and often faster) than native code.
Reed Copsey
Thanks Reed, I'll remember that! I've never used .NET frameworks very much other than a few GUI features, and I was wondering if/when to learn. Looks like it will be soon
Xzhsh
+1  A: 

My advice would be to approach this in a step-wise fashion.

  1. First, prove that you have a functional non-MATLAB implementation. This is non-trivial and, frankly, I think you should plan on spending 100% of your brain cycles getting correctness first before you think about performance.

  2. Partition your solution: prove that you can take the routine that you think is decoupled from the rest of the implementation and isolate it syntactically from the rest of the code. For example, if you were talking about a ray tracer, you could take the math that result from a single view point shooting a ray through a single pixel into the common environment. This is also non-trivial as it will require you to think about what is actually common (e.g., the geometry of the environment, texture maps, etc.) and what is specific to a unique situation (e.g., ray from eye to pixel). Performance profiling is your friend here.

  3. Identify the syntax of the libraries or frameworks that you are interested in that will be required to create threads / processes in parallel, launch them and join their results after completion. Note: you'll need to have mutual exclusion on shared data, etc. For example, in Java world, this would be java.util.concurrency.

  4. Try creating two (only two) threads to partition your work in half. Write benchmarks that allow you to measure your initial solution, the solution for N=2 threads and profile the hell out of the results.

  5. Only then should you even think about further parallelization.

If you follow steps like these, you'll (a) succeed at your actual task (port from MATLAB), (b) have something that works to some known performance metrics and (c) have a clear path forward if you'd like to further exploit parallization opportunities.

Bob Cross
Thanks for the tip Bob!I have already ported the routines over to mostly C, and the parallelism would only be between frames which are completely independent.I like your advice though, and will be sure to keep it in mind for my next project
Xzhsh
@Xzhsh, FYI, in my own graphics work the best parallelism was actually per-frame rather than assigning a whole frame independently to individual processors. The shared environment strongly motivated partitioning the pixels to the various threads and speeding up the calculation of a single frame (it was a raytracer after all). The choice of approaches is probably another good topic for you to investigate as a part of your project.
Bob Cross