ansaurus

Question

transpose for 8 registers of 16-bit elements on SSE2/SSSE3

Answer 1

+4 A:

Yes, you can do it in 24 instructions total:

8 x _mm_unpacklo_epi16/_mm_unpackhi_epi16 (PUNPCKLWD/PUNPCKHWD)
8 x _mm_unpacklo_epi32/_mm_unpackhi_epi32 (PUNPCKLDQ/PUNPCKHDQ)
8 x _mm_unpacklo_epi64/_mm_unpackhi_epi64 (PUNPCKLQDQ/PUNPCKHQDQ)

Let me know if you need more details, but it's fairly obvious.

Paul R 2010-03-25 19:22:16

Nice one, mate! By any chance could you point me in some direction where to find more of the basic transformations with SSE?

alecco 2010-03-25 21:00:40

@aleccolocco: there's not a lot of good material on SSE out there, unfortunately, at least for the more advanced topics. I recommend looking at AltiVec resources (e.g. on developer.apple.com) - a lot of AltiVec techniques translate easily to SSE.

Paul R 2010-03-25 22:19:44

Good news: I managed to do it. Bad news: only 5% performance gain measuring on 1M elements. But thanks, I've learned some cool SSE tricks!

alecco 2010-03-25 23:16:31

@aleccolocco: if you're just doing a memory-to-memory transpose and nothing else then your performance may well be limited by memory bandwidth etc - in general you'll get much better overall performance if you can combine the transpose with other operations. Also note that SSE performance varies *hugely* between different CPU families: e.g. before Core 2 Duo = abysmal, Core 2 Duo = good, Core i7 = *rocks* !

Paul R 2010-03-25 23:39:26

@Paul R Yeah. I'm implementing "Efficient implementation of sorting on multi-core SIMD CPU architecture" and a few other things. My notebook's Merom seems to be 8x slower than their Xeon Penryn and don't want to even know how faster it would be on i7. Still, the 1M elements should be only 2MB and well inside the L2 here (so it's not bandwidth, I think.) Cheers!

alecco 2010-03-25 23:56:54

@aleccolocco: OK, yes, it's probably your CPU that's the limiting factor then. Note also though that instruction scheduling may be an issue if you're using asm - you'll probably get better results using C instrinsics (_mm_unpackX_XXX) and let the C compiler do the scheduling. ICC is the best compiler for this, followed by gcc, followed by the execrable Visual Studio. Also if you can run on a 64 bit CPU then compile with -m64 so that you get 16 SSE registers rather than 8.

Paul R 2010-03-26 06:55:10

ansaurus

tags:

views:

answers:

transpose for 8 registers of 16-bit elements on SSE2/SSSE3

related questions