I was curious on the overhead of a large structure vs. a small structure in using operators +
and *
for math. So I made two struct, one Small
with 1 double field (8 bytes) and one Big
with 10 doubles (80 bytes). In all my operations I only manipulate one field called x
.
First I defined in both structures mathematical operators like
public static Small operator +(Small a, Small b)
{
return new Small(a.x + b.x);
}
public static Small operator *(double x, Small a)
{
return new Small(x * a.x);
}
which as expected use up a lot of memory in the stack for copying fields around. I run 5,000,000 iterations of a mathematical operation and got what I suspected (3 times slowdown).
public double TestSmall()
{
pt.Start(); // pt = performance timing object
Small r = new Small(rnd.NextDouble()); //rnd = Random number generator
for (int i = 0; i < N; i++)
{
a = 0.6 * a + 0.4 * r; // a is a local field of type Small
}
pt.Stop();
return pt.ElapsedSeconds;
}
results from Release code (in seconds)
Small=0.33940 Big=0.98909 Big is Slower by x2.91
Now for the interesting part. I define the same operations with static methods with ref
arguments
public static void Add(ref Small a, ref Small b, ref Small res)
{
res.x = a.x + b.x;
}
public static void Scale(double x, ref Small a, ref Small res)
{
res.x = x * a.x;
}
and run the same number of iterations on this test code:
public double TestSmall2()
{
pt.Start(); // pt = performance timing object
Small a1 = new Small(); // local
Small a2 = new Small(); // local
Small r = new Small(rnd.NextDouble()); //rdn = Random number generator
for (int i = 0; i < N; i++)
{
Small.Scale(0.6, ref a, ref a1);
Small.Scale(0.4, ref r, ref a2);
Small.Add(ref a1, ref a2, ref a);
}
pt.Stop();
return pt.ElapsedSeconds;
}
And the results show (in seconds)
Small=0.11765 Big=0.07130 Big is Slower by x0.61
So compared to the mem-copy intensive operators I get a speedup of x3 and x14 which is great, but compare the Small struct times to the Big and you will see that Small is 60% slower than Big.
Can anyone explain this? Does it have to do with CPU pipeline and separating out operations in (spatially) memory makes for more efficient pre-fetch of data?
If you want to try this for yourself grab the code from my dropbox http://dl.dropbox.com/u/11487099/SmallBigCompare.zip