views:

495

answers:

3

GDIPlus blend functions use premultiplied rgb channel by alpha bitmaps for efficiency. However premultiplying by alpha is a very costly since you have to treat each pixel one by one.

It seem that it would be a good candidate for SSE assembly. Is there someone here that would want to share its implementation? I know that this is hard work so that's the reason I ask. I'm not trying to steal your work. You'll get all my consideration for sharing this if you can.

Edit : I'm not trying to do alpha blending by software. I'm trying to premultiply each color component of each pixel in an image by its alpha. I'm doing this because the alpha blend is done by the formula : dst=src*src.alpha+dst*(1-dst.alpha) however the AlphaBlend Win32 function does implement dst=src+dst(1-dst.alpha) for optimisation reason. To get the correct result you need that src be equal to src*src.alpha before calling AlphaBlend.

It would take me a bit of time to write as I know little about assembly so I was asking if someone would like to share its implementation. SSE would be great as in the paper the gain would alpha blending by software is 300%.

+2  A: 

There's a good article found here. It's a bit old but you might find something useful in the section where it uses MMX to implement alpha blending. This could be easily translated to SSE instructions to take advantage of larger register sizes (128bit)

MMX Enhanced Alpha Blending

Intel Application Notes here, with source code

Using MMX™ Instructions to Implement Alpha Blending

Indeera
A: 

You may want to have a look at the Eigen C++ template library. It allows you to use high level C++ code that uses optimized assembler with support for SSE/Altivec.

Fast. (See benchmark).
Expression templates allow to intelligently remove temporaries and enable lazy evaluation, when that is appropriate -- Eigen takes care of this automatically and handles aliasing too in most cases. Explicit vectorization is performed for the SSE (2 and later) and AltiVec instruction sets, with graceful fallback to non-vectorized code. Expression templates allow to perform these optimizations globally for whole expressions. With fixed-size objects, dynamic memory allocation is avoided, and the loops are unrolled when that makes sense. For large matrices, special attention is paid to cache-friendliness.

Elegant. (See API showcase).
The API is extremely clean and expressive, thanks to expression templates. Implementing an algorithm on top of Eigen feels like just copying pseudocode. You can use complex expressions and still rely on Eigen to produce optimized code: there is no need for you to manually decompose expressions into small steps.

lothar
A: 

Treating each pixel is not expensive with native Win32 GDI apis.
See MSDN