views:

325

answers:

6

I have this C# method which I'm trying to optimize:

// assume arrays are same dimensions
private void DoSomething(int[] bigArray1, int[] bigArray2)
{
    int data1;
    byte A1, B1, C1, D1;
    int data2;
    byte A2, B2, C2, D2;
    for (int i = 0; i < bigArray1.Length; i++)
    {
        data1 = bigArray1[i];
        data2 = bigArray2[i];

        A1 = (byte)(data1 >> 0);
        B1 = (byte)(data1 >> 8);
        C1 = (byte)(data1 >> 16);
        D1 = (byte)(data1 >> 24);

        A2 = (byte)(data2 >> 0);
        B2 = (byte)(data2 >> 8);
        C2 = (byte)(data2 >> 16);
        D2 = (byte)(data2 >> 24);

        A1 = A1 > A2 ? A1 : A2;
        B1 = B1 > B2 ? B1 : B2;
        C1 = C1 > C2 ? C1 : C2;
        D1 = D1 > D2 ? D1 : D2;

        bigArray1[i] = (A1 << 0) | (B1 << 8) | (C1 << 16) | (D1 << 24); 
    }
}

The function basically compares two int arrays. For each pair of matching elements, the method compares each individual byte value and takes the larger of the two. The element in the first array is then assigned a new int value constructed from the 4 largest byte values (irrespective of source).

I think I have optimized this method as much as possible in C# (probably I haven't, of course - suggestions on that score are welcome as well). My question is, is it worth it for me to move this method to an unmanaged C DLL? Would the resulting method execute faster (and how much faster), taking into account the overhead of marshalling my managed int arrays so they can be passed to the method?

If doing this would get me, say, a 10% speed improvement, then it would not be worth my time for sure. If it was 2 or 3 times faster, then I would probably have to do it.

Note: please, no "premature optimization" comments, thanks in advance. This is simply "optimization".

Update: I realized that my code sample didn't capture everything I'm trying to do in this function, so here is an updated version:

private void DoSomethingElse(int[] dest, int[] src, double pos, 
    double srcMultiplier)
{
    int rdr;
    byte destA, destB, destC, destD;
    double rem = pos - Math.Floor(pos);
    double recipRem = 1.0 - rem;
    byte srcA1, srcA2, srcB1, srcB2, srcC1, srcC2, srcD1, srcD2;
    for (int i = 0; i < src.Length; i++)
    {
        // get destination values
        rdr = dest[(int)pos + i];
        destA = (byte)(rdr >> 0);
        destB = (byte)(rdr >> 8);
        destC = (byte)(rdr >> 16);
        destD = (byte)(rdr >> 24);
        // get bracketing source values
        rdr = src[i];
        srcA1 = (byte)(rdr >> 0);
        srcB1 = (byte)(rdr >> 8);
        srcC1 = (byte)(rdr >> 16);
        srcD1 = (byte)(rdr >> 24);
        rdr = src[i + 1];
        srcA2 = (byte)(rdr >> 0);
        srcB2 = (byte)(rdr >> 8);
        srcC2 = (byte)(rdr >> 16);
        srcD2 = (byte)(rdr >> 24);
        // interpolate (simple linear) and multiply
        srcA1 = (byte)(((double)srcA1 * recipRem) + 
            ((double)srcA2 * rem) * srcMultiplier);
        srcB1 = (byte)(((double)srcB1 * recipRem) +
            ((double)srcB2 * rem) * srcMultiplier);
        srcC1 = (byte)(((double)srcC1 * recipRem) +
            ((double)srcC2 * rem) * srcMultiplier);
        srcD1 = (byte)(((double)srcD1 * recipRem) +
            ((double)srcD2 * rem) * srcMultiplier);
        // bytewise best-of
        destA = srcA1 > destA ? srcA1 : destA;
        destB = srcB1 > destB ? srcB1 : destB;
        destC = srcC1 > destC ? srcC1 : destC;
        destD = srcD1 > destD ? srcD1 : destD;
        // convert bytes back to int
        dest[i] = (destA << 0) | (destB << 8) |
            (destC << 16) | (destD << 24);
    }
}

Essentially this does the same thing as the first method, except in this one the second array (src) is always smaller than the first (dest), and the second array is positioned fractionally relative to the first (meaning that instead of being position at, say, 10 relative to dest, it can be positioned at 10.682791).

To achieve this, I have to interpolate between two bracketing values in the source (say, 10 and 11 in the above example, for the first element) and then compare the interpolated bytes with the destination bytes.

I suspect here that the multiplication involved in this function is substantially more costly than the byte comparisons, so that part may be a red herring (sorry). Also, even if the comparisons are still somewhat expensive relative to the multiplications, I still have the problem that this system can actually be multi-dimensional, meaning that instead of comparing 1-dimensional arrays, the arrays could be 2-, 5- or whatever-dimensional, so that eventually the time taken to calculate interpolated values would dwarf the time taken by the final bytewise comparison of 4 bytes (I'm assuming that's the case).

How expensive is the multiplication here relative to the bit-shifting, and is this the kind of operation that could be sped up by being offloaded to a C DLL (or even an assembly DLL, although I'd have to hire somebody to create that for me)?

A: 

You might like to look at the BitConverter class - can't remember if it is the right endianness for the particular conversion you're trying to do, but worth knowing about anyway.

David M
In this case the function call overhead to call the BitConverter function would almost certainly overshadow any gain you would see from it.
Gabe
Actually, my code originally used BitConverter for this, but it's much slower than bit-shifting as in the above examples. I think (but do not know) that it's mainly because BitConverter.GetBytes() has to allocate a new byte array and return it every time it's called.
MusiGenesis
+2  A: 

What about this?

    private void DoSomething(int[] bigArray1, int[] bigArray2)
    {
        for (int i = 0; i < bigArray1.Length; i++)
        {
            var data1 = (uint)bigArray1[i];
            var data2 = (uint)bigArray2[i];

            bigArray1[i] = (int)(
                Math.Max(data1 & 0x000000FF, data2 & 0x000000FF) |
                Math.Max(data1 & 0x0000FF00, data2 & 0x0000FF00) |
                Math.Max(data1 & 0x00FF0000, data2 & 0x00FF0000) |
                Math.Max(data1 & 0xFF000000, data2 & 0xFF000000));
        }
    }

It has a lot less bit shifting in it. You might find the calls to Math.Max aren't inlined if you profile it. In such a case, you'd just make the method more verbose.

I haven't tested this code as I don't have an IDE with me. I reckon it does what you want though.

If this still doesn't perform as you'd expect, you could try using pointer arithmetic in an unsafe block, but I seriously doubt that you'd see a gain. Code like this is unlikely to be faster if you extern to it, from everything I've read. But don't take my word for it. Measure, measure, measure.

Good luck.

Drew Noakes
My real code actually does some additional math operations within the loop (I left them out for "clarity" - :) ), so my question was a more general one about how much faster an array-processing method would be as a C DLL. I can't really measure how much faster it would be without actually writing it, but I was asking here to find out if it's worth the effort or not.
MusiGenesis
I think you need to double up the masks (0x000F should be 0x000000FF). Also since `int` is a signed 32-bit integer, the final comparison (with mask 0xFF000000) may cause issues; replacing `int` with `uint` throughout is probably the way forward. But the principle is sound :-)
psmears
@psmears: would it be faster than just bit-shifting?
MusiGenesis
@psmears -- you're right, thanks.
Drew Noakes
@MusiGenesis -- I guess you'd have to look at the Intel (or other chip manufacturer) charts and see the cycle counts for the AND/OR/SHIFT operations. You could also use VS to show the optimised JITted assembly code, though it takes some setting up. There's a good article about this somewhere on codeproject.com.
Drew Noakes
Unsafe code here will give you a slight performance improvement. While array-bounds checks are eliminated for accessing `bigArray1` (because the compiler knows that `i` will never go out of bounds, checks are unnecessary), it will use bounds checks for `bigArray2` because it can't know whether or not `i` will get bigger than the length of `bigArray2`. It won't be much faster, but still. As for other optimizations: isn't this method embarrassingly parallel? I think you'll see a 2x speed-up from using `Parallel.For` on a dual-core.
JulianR
It appears that the calls to Math.Max are not inlined, as this code is about 50% slower than the original.
Jim Mischel
I tried the run the above, and for me, under .NET 4, the timings with 100 million items are:1.05 seconds for original (x86), 0.80 for original (x64), 0.75 for the above (x86), 0.63 for the above (x64). So the above version running under x64 is significantly faster than the original. The calls to Math.Max have definitely been inlined for me. A direct port to VC++ 2010 of the original function performed identical to C#.
JulianR
@JulianR: I think you left an earlier comment about this being a good candidate for parallel processing, and you're right. Unfortunately, it's a Windows Mobile app.
MusiGenesis
+2  A: 

I don't see any way of speeding up this code by means of clever bit tricks.

If you really want this code to be faster, the only way of significantly (>2x or so) speeding it up on x86 platform I see is to go for assembler/intrinsics implementation. SSE has the instruction PCMPGTB that

"Performs a SIMD compare for the greater value of the packed bytes, words, or doublewords in the destination operand (first operand) and the source operand (second operand). If a data element in the destination operand is greater than the corresponding date element in the source operand, the corresponding data element in the destination operand is set to all 1s; otherwise, it is set to all 0s."

XMM register would fit four 32-bit ints, and you could loop over your arrays reading the values, getting the mask and then ANDing the first input with the mask and the second one with inverted mask.

On the other hand, maybe you can reformulate your algorithm so that you don't need to to pick larger bytes, but maybe for example take AND of the operands? Just a thought, hard to see if it can work without seeing the actual algorithm.

Krystian
+2  A: 

The function below uses unsafe code to treat the integer arrays as arrays of bytes so that there's no need for bit twiddling.

    private static void DoOtherThing(int[] bigArray1, int[] bigArray2)
    {
        unsafe
        {
            fixed (int* p1 = bigArray1, p2=bigArray2)
            {
                byte* b1 = (byte*)p1;
                byte* b2 = (byte*)p2;
                byte* bend = (byte*)(&p1[bigArray1.Length]);
                while (b1 < bend)
                {
                    if (*b1 < *b2)
                    {
                        *b1 = *b2;
                    }
                    ++b1;
                    ++b2;
                }
            }
        }
    }

On my machine running under the debugger in Release mode against arrays of 25 million ints, this code is about 29% faster than your original. However, running standalone, there is almost no difference in runtime. Sometimes your original code is faster, and sometimes the new code is faster.

Approximate numbers:

          Debugger  Standalone
Original  1,400 ms    700 ms
My code     975 ms    700 ms

And, yes, I did compare the results to ensure that the functions do the same thing.

I'm at a loss to explain why my code isn't faster, since it's doing significantly less work.

Given these results, I doubt that you could improve things by going to native code. As you say, the overhead of marshaling the arrays would likely eat up any savings you might realize in the processing.

The following modification to your original code, though, is 10% to 20% faster.

    private static void DoSomething(int[] bigArray1, int[] bigArray2)
    {
        for (int i = 0; i < bigArray1.Length; i++)
        {
            var data1 = (uint)bigArray1[i];
            var data2 = (uint)bigArray2[i];

            var A1 = data1 & 0xff;
            var B1 = data1 & 0xff00;
            var C1 = data1 & 0xff0000;
            var D1 = data1 & 0xff000000;

            var A2 = data2 & 0xff;
            var B2 = data2 & 0xff00;
            var C2 = data2 & 0xff0000;
            var D2 = data2 & 0xff000000;

            if (A2 > A1) A1 = A2;
            if (B2 > B1) B1 = B2;
            if (C2 > C1) C1 = C2;
            if (D2 > D1) D1 = D2;

            bigArray1[i] = (int)(A1 | B1 | C1 | D1);
        }
    }
Jim Mischel
+1 for beating me to it :) With a nearly identical version of the above code and 50M ints, I see a 4x speedup in standalone Release. I wonder what we are doing differently. Are you accidentally measuring the array creation and comparison? I used one executable to test both pieces of code back-to-back, so maybe some form of caching interred with my results.
Dan Terry
I get a 4x speedup if I pass the modified bigArray1 to the second function, after calling the first function. But if I use a copy of the array, the times are the same. And, no, I didn't measure the array creation and comparison.
Jim Mischel
I found my very dumb mistake (when I recreated array1, I was using a smaller length from a previous test), and now my numbers are on par with yours. On my machine/setup, the unsafe version is actually running slightly slower than the safe version. Thanks Jim!
Dan Terry
No problem, Terry. I'm still trying to figure out why the unsafe version is slower at all. I'm thinking it's due to runtime memory checks on write.
Jim Mischel
I think we are not seeing a speedup because we are looping over byte pointers rather than int pointers. So for every item in the input array, we have to do 4 comparisons plus 8 increments just to manage the while loop. Looping over int pointers takes those numbers to 1 and 2 respectively. This change gets me to a 2x speedup over the original posters code (standalone, Release, data dependent because the data determines the number of writes)... barring any dumb mistakes.
Dan Terry
Dan: Yes, we're doing four times the increments and, on average, 50% more writes to the array. But the original code does a whole lot of bit twiddling and assignments to local variables. I'm thinking it's the writes that are the problem, but haven't come up with a way to test yet . . .
Jim Mischel
Wow. Compiling to Release mode instead of Debug just sped my code up by about 2-3 times. I also got a small improvement by using `float` instead of `double`.
MusiGenesis
+7  A: 

Yes, the _mm_max_epu8() intrinsic does what you want. Chews through 16 bytes at a time. The pain-point is the arrays. SSE2 instructions require their arguments to be aligned at 16-byte addresses. You cannot get that out of the garbage collected heap, it only promises 4-byte alignment. Even if you trick it by calculating an offset in the array that's 16-byte aligned then you'll lose when the garbage collector kicks in and moves the array.

You'll have to declare the arrays in the C/C++ code, using the __declspec(align(#)) declarator. Now you need to copy your managed arrays into those unmanaged ones. And the results back. Whether you are still ahead depends on details not easily seen in your question.

Hans Passant
Would pinning the arrays solve the issue with the GC moving them?
chibacity
+1 -- great answer. Pinning the arrays via `fixed` will stop the GC moving them, but I don't think you can influence 16-byte alignment within .NET, even with explicit struct layout.
Drew Noakes
Fixed arrays on the stack or stackalloc() could work. Over-allocate them by 12 bytes so you can bump up the byte* to 16-byte alignment. Just gotta keep the stack frame alive long enough.
Hans Passant
+1, great, I forgot there is an instruction like this.
Krystian
This may be worth trying for me. I can live with having to declare the arrays inside the C code. Sorry I didn't include enough details in the question, but I also need to do some math operations on the byte values (like multiplication) before comparing. Given the number of calls I'm making to this method, the overall operation is taking 15-20 seconds, and I'm hoping to pare it down to 5 seconds or less. Is this possible with a switch to a C DLL? Am I looking at 3-4 times faster this way?
MusiGenesis
It should be a whole lot faster than x4. One instruction vs ~126.
Hans Passant
@Hans: thanks very much for all your suggestions. I added a new function example to my question, to show more accurately what I'm trying to do.
MusiGenesis
@Hans, as discussed elsewhere in the comments, `stackalloc` only works for arrays declared and used in the stack frame. A function like the one shown would need to copy the array for stackalloc, which would be a setback. If I don't understand what you mean, could you explain more? Also, how does it enforce 16 byte alignment? Perhaps if you can get the pointer in unsafe code, you could `mod 16` it and add whatever extra is needed before allocating again. Anyway, will be interesting to see what results come of this.
Drew Noakes
+1  A: 

Another option for you is, if you're able to run Mono, is to use the Mono.Simd package. This provides access into SIMD instruction set from within .NET. Unfortunately you can't just take the assembly and run it on MS's CLR, as the Mono runtime treats is in a special way at JIT time. The actual assembly contains regular IL (non-SIMD) 'simulations' of the SIMD operations as a fall-back, in case the hardware does not support SIMD instructions.

You also need to be able to express your problem using the types that the API consumes, as far as I can make out.

Here is the blog post in which Miguel de Icaza announced the capability back in November 2008. Pretty cool stuff. Hopefully it will be added to the ECMA standard and MS can add it to their CLR.

Drew Noakes