I have this C# method which I'm trying to optimize:
// assume arrays are same dimensions
private void DoSomething(int[] bigArray1, int[] bigArray2)
{
int data1;
byte A1, B1, C1, D1;
int data2;
byte A2, B2, C2, D2;
for (int i = 0; i < bigArray1.Length; i++)
{
data1 = bigArray1[i];
data2 = bigArray2[i];
A1 = (byte)(data1 >> 0);
B1 = (byte)(data1 >> 8);
C1 = (byte)(data1 >> 16);
D1 = (byte)(data1 >> 24);
A2 = (byte)(data2 >> 0);
B2 = (byte)(data2 >> 8);
C2 = (byte)(data2 >> 16);
D2 = (byte)(data2 >> 24);
A1 = A1 > A2 ? A1 : A2;
B1 = B1 > B2 ? B1 : B2;
C1 = C1 > C2 ? C1 : C2;
D1 = D1 > D2 ? D1 : D2;
bigArray1[i] = (A1 << 0) | (B1 << 8) | (C1 << 16) | (D1 << 24);
}
}
The function basically compares two int
arrays. For each pair of matching elements, the method compares each individual byte value and takes the larger of the two. The element in the first array is then assigned a new int
value constructed from the 4 largest byte values (irrespective of source).
I think I have optimized this method as much as possible in C# (probably I haven't, of course - suggestions on that score are welcome as well). My question is, is it worth it for me to move this method to an unmanaged C DLL? Would the resulting method execute faster (and how much faster), taking into account the overhead of marshalling my managed int
arrays so they can be passed to the method?
If doing this would get me, say, a 10% speed improvement, then it would not be worth my time for sure. If it was 2 or 3 times faster, then I would probably have to do it.
Note: please, no "premature optimization" comments, thanks in advance. This is simply "optimization".
Update: I realized that my code sample didn't capture everything I'm trying to do in this function, so here is an updated version:
private void DoSomethingElse(int[] dest, int[] src, double pos,
double srcMultiplier)
{
int rdr;
byte destA, destB, destC, destD;
double rem = pos - Math.Floor(pos);
double recipRem = 1.0 - rem;
byte srcA1, srcA2, srcB1, srcB2, srcC1, srcC2, srcD1, srcD2;
for (int i = 0; i < src.Length; i++)
{
// get destination values
rdr = dest[(int)pos + i];
destA = (byte)(rdr >> 0);
destB = (byte)(rdr >> 8);
destC = (byte)(rdr >> 16);
destD = (byte)(rdr >> 24);
// get bracketing source values
rdr = src[i];
srcA1 = (byte)(rdr >> 0);
srcB1 = (byte)(rdr >> 8);
srcC1 = (byte)(rdr >> 16);
srcD1 = (byte)(rdr >> 24);
rdr = src[i + 1];
srcA2 = (byte)(rdr >> 0);
srcB2 = (byte)(rdr >> 8);
srcC2 = (byte)(rdr >> 16);
srcD2 = (byte)(rdr >> 24);
// interpolate (simple linear) and multiply
srcA1 = (byte)(((double)srcA1 * recipRem) +
((double)srcA2 * rem) * srcMultiplier);
srcB1 = (byte)(((double)srcB1 * recipRem) +
((double)srcB2 * rem) * srcMultiplier);
srcC1 = (byte)(((double)srcC1 * recipRem) +
((double)srcC2 * rem) * srcMultiplier);
srcD1 = (byte)(((double)srcD1 * recipRem) +
((double)srcD2 * rem) * srcMultiplier);
// bytewise best-of
destA = srcA1 > destA ? srcA1 : destA;
destB = srcB1 > destB ? srcB1 : destB;
destC = srcC1 > destC ? srcC1 : destC;
destD = srcD1 > destD ? srcD1 : destD;
// convert bytes back to int
dest[i] = (destA << 0) | (destB << 8) |
(destC << 16) | (destD << 24);
}
}
Essentially this does the same thing as the first method, except in this one the second array (src
) is always smaller than the first (dest
), and the second array is positioned fractionally relative to the first (meaning that instead of being position at, say, 10 relative to dest, it can be positioned at 10.682791).
To achieve this, I have to interpolate between two bracketing values in the source (say, 10 and 11 in the above example, for the first element) and then compare the interpolated bytes with the destination bytes.
I suspect here that the multiplication involved in this function is substantially more costly than the byte comparisons, so that part may be a red herring (sorry). Also, even if the comparisons are still somewhat expensive relative to the multiplications, I still have the problem that this system can actually be multi-dimensional, meaning that instead of comparing 1-dimensional arrays, the arrays could be 2-, 5- or whatever-dimensional, so that eventually the time taken to calculate interpolated values would dwarf the time taken by the final bytewise comparison of 4 bytes (I'm assuming that's the case).
How expensive is the multiplication here relative to the bit-shifting, and is this the kind of operation that could be sped up by being offloaded to a C DLL (or even an assembly DLL, although I'd have to hire somebody to create that for me)?