views:

145

answers:

3

Given a register of 4 bytes (or 16 for SIMD), there has to be an efficient way to sort the bytes in-register with a few instructions.

Thanks in advance.

+1  A: 

All sorting algorithms require "swapping" values from one place to another. Since you're talking about a literal CPU register, that means any sort would need another register to use as a temporary place to hold the bytes being swapped.

I've never seen a chip with a built-in method for sorting bytes within a register. Not saying it hasn't been done, but I can't think of many uses for such an instruction.

richardtallent
I meant as sort the bytes in a register, of course have to use at least another register. Sorry for the misunderstanding.
alecco
Actually there is a way for in-register sorting using CMPXCHG using eax register and rotating it, as a friend who is quite knowledgeable in x86 showed me. Little gain from it, but it is possible. Also CMPXCHG is quite slow.
alecco
All SIMD architectures that I've used have such instructions.
alex strange
+3  A: 

Look up an efficient sorting network for N = the number of bytes you care about (4 or 16). Convert that to a sequence of compare and exchange instructions. (For N=16 that'll be more than 'a few', though.)

Darius Bacon
Thanks. I'm looking for an asm efficient solution.Oh, please note I said a "few instructions" and not a "few cycles" ;)
alecco
Ah, I see that the paper you linked to takes just this approach, using SSE2 instructions. Cool.
Darius Bacon
Yeah, I didn't want to be too verbose, as I was hoping for some sort of bit hack magic with asm. In fact I was looking for this reading "Efficient Implementation of Sorting on Multi-Core SIMD CPU Architecture" (Chhugani,.. 2008), but got frustrated with the instructions for the algorithm:1) a) Perform In-Register Sort to obtain sorted sequences of length K.I guess for researchers at Intel that's a "duh" procedure, but not for me! (I'm still not sure they do the whole 17-19 instruction procedure to sort a register.)[Note: sorry, didn't up-vote you because of lack of karma]
alecco
I learned something from a skim of the 2007 paper -- reward enough. :-)
Darius Bacon
By the way, there's a very efficient simultaneous 4 register sort on the original (2008) paper. In my face, my bad.
alecco
+3  A: 

Found it! It's in the 2007 paper "Using SIMD Registers and Instructions to Enable Instruction-Level Parallelism in Sorting Algorithms" by Furtak, Amaral, and Niewiadomski. Section 4.

It uses 4 SSE registers, has 12 steps, and runs in 19 instructions including load and store.

The same paper has some excellent work on dynamically making sorting networks with SIMD.

alecco
Link to PDF: http://www.cs.ualberta.ca/~amaral/papers/furtak-spaa07.pdf
alecco