ansaurus

Question

Help with optimizing C# function via C and/or Assembly

Answer 1

A:

You might like to look at the BitConverter class - can't remember if it is the right endianness for the particular conversion you're trying to do, but worth knowing about anyway.

David M 2010-05-30 13:20:59

In this case the function call overhead to call the BitConverter function would almost certainly overshadow any gain you would see from it.

Gabe 2010-05-31 01:48:06

Actually, my code originally used BitConverter for this, but it's much slower than bit-shifting as in the above examples. I think (but do not know) that it's mainly because BitConverter.GetBytes() has to allocate a new byte array and return it every time it's called.

MusiGenesis 2010-05-31 03:48:54

Answer 2

+2 A:

What about this?

    private void DoSomething(int[] bigArray1, int[] bigArray2)
    {
        for (int i = 0; i < bigArray1.Length; i++)
        {
            var data1 = (uint)bigArray1[i];
            var data2 = (uint)bigArray2[i];

            bigArray1[i] = (int)(
                Math.Max(data1 & 0x000000FF, data2 & 0x000000FF) |
                Math.Max(data1 & 0x0000FF00, data2 & 0x0000FF00) |
                Math.Max(data1 & 0x00FF0000, data2 & 0x00FF0000) |
                Math.Max(data1 & 0xFF000000, data2 & 0xFF000000));
        }
    }

It has a lot less bit shifting in it. You might find the calls to Math.Max aren't inlined if you profile it. In such a case, you'd just make the method more verbose.

I haven't tested this code as I don't have an IDE with me. I reckon it does what you want though.

If this still doesn't perform as you'd expect, you could try using pointer arithmetic in an unsafe block, but I seriously doubt that you'd see a gain. Code like this is unlikely to be faster if you extern to it, from everything I've read. But don't take my word for it. Measure, measure, measure.

Good luck.

Drew Noakes 2010-05-30 13:38:05

My real code actually does some additional math operations within the loop (I left them out for "clarity" - :) ), so my question was a more general one about how much faster an array-processing method would be as a C DLL. I can't really measure how much faster it would be without actually writing it, but I was asking here to find out if it's worth the effort or not.

MusiGenesis 2010-05-30 13:48:47

I think you need to double up the masks (0x000F should be 0x000000FF). Also since `int` is a signed 32-bit integer, the final comparison (with mask 0xFF000000) may cause issues; replacing `int` with `uint` throughout is probably the way forward. But the principle is sound :-)

psmears 2010-05-30 13:49:02

@psmears: would it be faster than just bit-shifting?

MusiGenesis 2010-05-30 13:50:28

@psmears -- you're right, thanks.

Drew Noakes 2010-05-30 14:07:51

@MusiGenesis -- I guess you'd have to look at the Intel (or other chip manufacturer) charts and see the cycle counts for the AND/OR/SHIFT operations. You could also use VS to show the optimised JITted assembly code, though it takes some setting up. There's a good article about this somewhere on codeproject.com.

Drew Noakes 2010-05-30 14:08:22

Unsafe code here will give you a slight performance improvement. While array-bounds checks are eliminated for accessing `bigArray1` (because the compiler knows that `i` will never go out of bounds, checks are unnecessary), it will use bounds checks for `bigArray2` because it can't know whether or not `i` will get bigger than the length of `bigArray2`. It won't be much faster, but still. As for other optimizations: isn't this method embarrassingly parallel? I think you'll see a 2x speed-up from using `Parallel.For` on a dual-core.

JulianR 2010-05-30 14:39:27

It appears that the calls to Math.Max are not inlined, as this code is about 50% slower than the original.

Jim Mischel 2010-05-30 15:44:32

I tried the run the above, and for me, under .NET 4, the timings with 100 million items are:1.05 seconds for original (x86), 0.80 for original (x64), 0.75 for the above (x86), 0.63 for the above (x64). So the above version running under x64 is significantly faster than the original. The calls to Math.Max have definitely been inlined for me. A direct port to VC++ 2010 of the original function performed identical to C#.

JulianR 2010-05-30 17:10:57

@JulianR: I think you left an earlier comment about this being a good candidate for parallel processing, and you're right. Unfortunately, it's a Windows Mobile app.

MusiGenesis 2010-05-31 03:56:30

Answer 3

+2 A:

I don't see any way of speeding up this code by means of clever bit tricks.

If you really want this code to be faster, the only way of significantly (>2x or so) speeding it up on x86 platform I see is to go for assembler/intrinsics implementation. SSE has the instruction PCMPGTB that

"Performs a SIMD compare for the greater value of the packed bytes, words, or doublewords in the destination operand (first operand) and the source operand (second operand). If a data element in the destination operand is greater than the corresponding date element in the source operand, the corresponding data element in the destination operand is set to all 1s; otherwise, it is set to all 0s."

XMM register would fit four 32-bit ints, and you could loop over your arrays reading the values, getting the mask and then ANDing the first input with the mask and the second one with inverted mask.

On the other hand, maybe you can reformulate your algorithm so that you don't need to to pick larger bytes, but maybe for example take AND of the operands? Just a thought, hard to see if it can work without seeing the actual algorithm.

Krystian 2010-05-30 15:23:00

Answer 4

+2 A:

The function below uses unsafe code to treat the integer arrays as arrays of bytes so that there's no need for bit twiddling.

    private static void DoOtherThing(int[] bigArray1, int[] bigArray2)
    {
        unsafe
        {
            fixed (int* p1 = bigArray1, p2=bigArray2)
            {
                byte* b1 = (byte*)p1;
                byte* b2 = (byte*)p2;
                byte* bend = (byte*)(&p1[bigArray1.Length]);
                while (b1 < bend)
                {
                    if (*b1 < *b2)
                    {
                        *b1 = *b2;
                    }
                    ++b1;
                    ++b2;
                }
            }
        }
    }

On my machine running under the debugger in Release mode against arrays of 25 million ints, this code is about 29% faster than your original. However, running standalone, there is almost no difference in runtime. Sometimes your original code is faster, and sometimes the new code is faster.

Approximate numbers:

          Debugger  Standalone
Original  1,400 ms    700 ms
My code     975 ms    700 ms

And, yes, I did compare the results to ensure that the functions do the same thing.

I'm at a loss to explain why my code isn't faster, since it's doing significantly less work.

Given these results, I doubt that you could improve things by going to native code. As you say, the overhead of marshaling the arrays would likely eat up any savings you might realize in the processing.

The following modification to your original code, though, is 10% to 20% faster.

    private static void DoSomething(int[] bigArray1, int[] bigArray2)
    {
        for (int i = 0; i < bigArray1.Length; i++)
        {
            var data1 = (uint)bigArray1[i];
            var data2 = (uint)bigArray2[i];

            var A1 = data1 & 0xff;
            var B1 = data1 & 0xff00;
            var C1 = data1 & 0xff0000;
            var D1 = data1 & 0xff000000;

            var A2 = data2 & 0xff;
            var B2 = data2 & 0xff00;
            var C2 = data2 & 0xff0000;
            var D2 = data2 & 0xff000000;

            if (A2 > A1) A1 = A2;
            if (B2 > B1) B1 = B2;
            if (C2 > C1) C1 = C2;
            if (D2 > D1) D1 = D2;

            bigArray1[i] = (int)(A1 | B1 | C1 | D1);
        }
    }

Jim Mischel 2010-05-30 15:40:17

+1 for beating me to it :) With a nearly identical version of the above code and 50M ints, I see a 4x speedup in standalone Release. I wonder what we are doing differently. Are you accidentally measuring the array creation and comparison? I used one executable to test both pieces of code back-to-back, so maybe some form of caching interred with my results.

Dan Terry 2010-05-30 15:57:36

I get a 4x speedup if I pass the modified bigArray1 to the second function, after calling the first function. But if I use a copy of the array, the times are the same. And, no, I didn't measure the array creation and comparison.

Jim Mischel 2010-05-30 16:12:34

I found my very dumb mistake (when I recreated array1, I was using a smaller length from a previous test), and now my numbers are on par with yours. On my machine/setup, the unsafe version is actually running slightly slower than the safe version. Thanks Jim!

Dan Terry 2010-05-30 16:23:14

No problem, Terry. I'm still trying to figure out why the unsafe version is slower at all. I'm thinking it's due to runtime memory checks on write.

Jim Mischel 2010-05-30 16:57:38

I think we are not seeing a speedup because we are looping over byte pointers rather than int pointers. So for every item in the input array, we have to do 4 comparisons plus 8 increments just to manage the while loop. Looping over int pointers takes those numbers to 1 and 2 respectively. This change gets me to a 2x speedup over the original posters code (standalone, Release, data dependent because the data determines the number of writes)... barring any dumb mistakes.

Dan Terry 2010-05-30 17:32:45

Dan: Yes, we're doing four times the increments and, on average, 50% more writes to the array. But the original code does a whole lot of bit twiddling and assignments to local variables. I'm thinking it's the writes that are the problem, but haven't come up with a way to test yet . . .

Jim Mischel 2010-05-30 17:48:56

Wow. Compiling to Release mode instead of Debug just sped my code up by about 2-3 times. I also got a small improvement by using `float` instead of `double`.

MusiGenesis 2010-05-31 15:57:40

Answer 5

+7 A:

Yes, the _mm_max_epu8() intrinsic does what you want. Chews through 16 bytes at a time. The pain-point is the arrays. SSE2 instructions require their arguments to be aligned at 16-byte addresses. You cannot get that out of the garbage collected heap, it only promises 4-byte alignment. Even if you trick it by calculating an offset in the array that's 16-byte aligned then you'll lose when the garbage collector kicks in and moves the array.

You'll have to declare the arrays in the C/C++ code, using the __declspec(align(#)) declarator. Now you need to copy your managed arrays into those unmanaged ones. And the results back. Whether you are still ahead depends on details not easily seen in your question.

Hans Passant 2010-05-30 16:22:47

Would pinning the arrays solve the issue with the GC moving them?

chibacity 2010-05-30 16:39:53

+1 -- great answer. Pinning the arrays via `fixed` will stop the GC moving them, but I don't think you can influence 16-byte alignment within .NET, even with explicit struct layout.

Drew Noakes 2010-05-30 17:40:12

Fixed arrays on the stack or stackalloc() could work. Over-allocate them by 12 bytes so you can bump up the byte* to 16-byte alignment. Just gotta keep the stack frame alive long enough.

Hans Passant 2010-05-30 17:45:57

+1, great, I forgot there is an instruction like this.

Krystian 2010-05-30 20:35:26

This may be worth trying for me. I can live with having to declare the arrays inside the C code. Sorry I didn't include enough details in the question, but I also need to do some math operations on the byte values (like multiplication) before comparing. Given the number of calls I'm making to this method, the overall operation is taking 15-20 seconds, and I'm hoping to pare it down to 5 seconds or less. Is this possible with a switch to a C DLL? Am I looking at 3-4 times faster this way?

MusiGenesis 2010-05-30 21:08:00

It should be a whole lot faster than x4. One instruction vs ~126.

Hans Passant 2010-05-30 21:41:17

@Hans: thanks very much for all your suggestions. I added a new function example to my question, to show more accurately what I'm trying to do.

MusiGenesis 2010-05-31 01:09:09

@Hans, as discussed elsewhere in the comments, `stackalloc` only works for arrays declared and used in the stack frame. A function like the one shown would need to copy the array for stackalloc, which would be a setback. If I don't understand what you mean, could you explain more? Also, how does it enforce 16 byte alignment? Perhaps if you can get the pointer in unsafe code, you could `mod 16` it and add whatever extra is needed before allocating again. Anyway, will be interesting to see what results come of this.

Drew Noakes 2010-05-31 01:15:30

Answer 6

+1 A:

Another option for you is, if you're able to run Mono, is to use the Mono.Simd package. This provides access into SIMD instruction set from within .NET. Unfortunately you can't just take the assembly and run it on MS's CLR, as the Mono runtime treats is in a special way at JIT time. The actual assembly contains regular IL (non-SIMD) 'simulations' of the SIMD operations as a fall-back, in case the hardware does not support SIMD instructions.

You also need to be able to express your problem using the types that the API consumes, as far as I can make out.

Here is the blog post in which Miguel de Icaza announced the capability back in November 2008. Pretty cool stuff. Hopefully it will be added to the ECMA standard and MS can add it to their CLR.

Drew Noakes 2010-05-31 05:38:34

ansaurus

tags:

views:

answers:

Help with optimizing C# function via C and/or Assembly

related questions