GDIPlus blend functions use premultiplied rgb channel by alpha bitmaps for efficiency. However premultiplying by alpha is a very costly since you have to treat each pixel one by one.
It seem that it would be a good candidate for SSE assembly. Is there someone here that would want to share its implementation? I know that this is hard work so that's the reason I ask. I'm not trying to steal your work. You'll get all my consideration for sharing this if you can.
Edit : I'm not trying to do alpha blending by software. I'm trying to premultiply each color component of each pixel in an image by its alpha. I'm doing this because the alpha blend is done by the formula : dst=src*src.alpha+dst*(1-dst.alpha) however the AlphaBlend Win32 function does implement dst=src+dst(1-dst.alpha) for optimisation reason. To get the correct result you need that src be equal to src*src.alpha before calling AlphaBlend.
It would take me a bit of time to write as I know little about assembly so I was asking if someone would like to share its implementation. SSE would be great as in the paper the gain would alpha blending by software is 300%.