ansaurus

Question

Fast little-endian to big-endian conversion in ASM

Answer 1

+5 A:

For a large amount of data, the bswap instruction (available in Visual C++ under the _byteswap_ushort, _byteswap_ulong, and _byteswap_uint64 intrinsics) is the way to go. This will even outperform handwritten assembly. These are not available in pure C# without P/Invoke, so:

Only use this if you have a lot of data to byte swap.
You should seriously consider writing your lowest level application I/O in managed C++ so you can do your swapping before ever bringing the data into a managed array. You already have to write a C++ library, so there's not much to lose and you sidestep all the P/Invoke-related performance issues for low-complexity algorithms operating on large datasets.

PS: Many people are unaware of the byte swap intrinsics. Their performance is astonishing, doubly so for floating point data because it processes them as integers. There is no way to beat it without hand coding your register loads for every single byte swap use case, and should you try that, you'll probably incur a bigger hit in the optimizer than you'll ever pick up.

280Z28 2009-08-31 18:32:47

Answer 2

A:

I was thinking to consider two uint types as an ulong type

Well, that would also swap the two uint values, which might not be desirable...

You could try some C# code in unsafe mode, which may actually perform well enough. Like:

public static unsafe void SwapInts(uint[] data) {
   int cnt = data.Length;
   fixed (uint* d = data) {
      byte* p = (byte*)d;
      while (cnt-- > 0) {
         byte a = *p;
         p++;
         byte b = *p;
         *p = *(p + 1);
         p++;
         *p = b;
         p++;
         *(p - 3) = *p;
         *p = a;
         p++;
      }
   }
}

On my computer the throughput is around 2 GB per second.

Guffa 2009-08-31 18:52:32

Answer 3

+2 A:

You may want to simply rethink the problem, this should not be a bottleneck. Take the naive algorithm (written in CLI assembly, just for fun). lets assume the number we want is in local number 0

LDLOC 0
SHL 24
LDLOC 0
LDC.i4 0x0000ff00
SHL 8
OR
LDLOC 0
LDC.i4 0x00ff0000
SHL.UN 8
OR
LDLOC 0
SHL.UN 24
OR

At most that's 13 (x86) assembly instructions per number (and most likely the interpreter will be even smarter by using clever registers). And it doesn't get more naive than that.

Now, compare that to the costs of

Getting the data loaded in (including whatever peripherals you are working with!)
Maniuplation of the data (doing comparisons, for instance)
Outputting the result (whatever it is)

If 13 instructions per number is a significant chunk of your execution time, then you are doing a VERY high performance task and should have your input in the correct format! You also probably would not be using a managed language because you would want far more control over buffers of data and what-not, and no extra array bounds checking.

If that array of data comes across a network, I would expect there to be much greater costs from the managing of sockets than from a mere byte order flip, if its from disk, consider pre-flipping before executing this program.

RH 2009-12-17 20:08:29

ansaurus

tags:

views:

answers:

Fast little-endian to big-endian conversion in ASM

related questions