Sorry for the vague topic question, but I'm working on some academic video processing routines. The algorithms are written in MATLAB, and while it's fine for development purposes, it processed a video at about 60spf, or around .0166fps. Needless to say, this wont be sufficient for demos and such, so my summer job is to convert the routine to something that will run drastically faster.
I have rewritten the slowest portion of the code for CUDA, nvidia's GPGPU solution. However, there is also a large portion of the code that seems to be better done on the CPU, as it is relatively serial. The problem is, the machine I was given has 2 Xeon processors, with 8 logical cores total, and it seems to be a shame to bottleneck the GPU code by coding only for single core. The video conversion process is functional in that each frame does not depend on other frames, so I was thinking some kind of asynchronous queue/stream would best.
Here lies my question: what would be the best way to achieve this type of parallelism with the best ratio of effort to speed yield?
Some of the solutions I've looked at are OpenMP, .net TPL, and just simple pthreads.
I only have basic exposure to asynchronous programming, so I would rather use a library or something rather than mess around with mutexes and barriers and shoot myself in the foot multiple times. I don't mind learning, because that's one of my goals for this summer, but at the same time, parallelism is hard. However, if the speed difference is actually very noticeable, I'm willing to pull my hair out for a couple weeks. :P
Thanks in advance.