ansaurus

Question

Answer 1

+1 A:

This kind of neighbourhood operation was always a pain with SSE, until SSE3.5 (aka SSSE3) came along, and PALIGNR (_mm_alignr_epi8) was introduced.

If you need backward compatibility with SSE2/SSE3 though, you can write an equivalent macro or inline function which emulates _mm_alignr_epi8 for SSE2/SSE3 and which drops through to _mm_alignr_epi8 when targetting SSE3.5/SSE4.

Another approach is to use misaligned loads to get the shifted data - this is relatively expensive on older CPUs (roughly twice the latency and half the throughput of aligned loads) but this may be acceptable depending on much much computation you're doing per load. It also has the benefit that on current Intel CPUs (Core i7) misaligned loads have no penalty compared to aligned loads, so your code will be quite efficient on Core i7 et al.

Paul R 2010-06-28 07:41:02

I noticed alignr already, but as you suspected, I want to be compatible with SSE2. I think SSE2 is a good "lowest common denominator" when it comes to SIMD on x86, and if SSE2 alone will give me a satisfying speedup, I won't bother implementing anything more advanced.

dietr 2010-06-28 09:39:02

Answer 2

+1 A:

I suggest keeping the neighbouring pixels on the SSE register. That is, keep the result of the _mm_slli_si128 / _mm_srli_si128 in an SSE variable, and eliminate all of the insert and extract. My reasoning is that in older CPUs, the insert/extract instructions require communication between the SSE units and the general-purpose units, which is much slower than keeping the calculation within SSE, even if it spills over to the L1 cache.

When that is done, there should be only four 16-bit shifts ( _mm_slli_si128, _mm_srli_si128, not counting the divison shift ). My suggestion is to do a benchmark with your code, because by that time your code may have already hit the memory bandwidth limit .. which means you can't optimize anymore.

If the image is large (bigger than L2 size) and the output won't be read back soon, try use MOVNTDQ ( _mm_stream_si128 ) for writing back. According to several websites it is in SSE2, although you might want to double-check.

SIMD tutorial:

Some SIMD guru websites:

rwong 2010-06-30 08:11:32

ansaurus

tags:

views:

answers:

SIMD/SSE newbie: simple image filtering

related questions