tags:

views:

65

answers:

5

I was curious on the overhead of a large structure vs. a small structure in using operators + and * for math. So I made two struct, one Small with 1 double field (8 bytes) and one Big with 10 doubles (80 bytes). In all my operations I only manipulate one field called x.

First I defined in both structures mathematical operators like

public static Small operator +(Small a, Small b)
{
    return new Small(a.x + b.x);
}
public static Small operator *(double x, Small a)
{
    return new Small(x * a.x);
}

which as expected use up a lot of memory in the stack for copying fields around. I run 5,000,000 iterations of a mathematical operation and got what I suspected (3 times slowdown).

public double TestSmall()
{
    pt.Start(); // pt = performance timing object
    Small r = new Small(rnd.NextDouble()); //rnd = Random number generator
    for (int i = 0; i < N; i++)
    {
        a = 0.6 * a + 0.4 * r;   // a is a local field of type Small
    }
    pt.Stop();
    return pt.ElapsedSeconds;
}

results from Release code (in seconds)

Small=0.33940 Big=0.98909     Big is Slower by x2.91

Now for the interesting part. I define the same operations with static methods with ref arguments

public static void Add(ref Small a, ref Small b, ref Small res)
{
    res.x = a.x + b.x;
}
public static void Scale(double x, ref Small a, ref Small res)
{
    res.x = x * a.x;
}

and run the same number of iterations on this test code:

public double TestSmall2()
{
    pt.Start(); // pt = performance timing object
    Small a1 = new Small(); // local
    Small a2 = new Small(); // local
    Small r = new Small(rnd.NextDouble()); //rdn = Random number generator
    for (int i = 0; i < N; i++)
    {
        Small.Scale(0.6, ref a, ref a1);
        Small.Scale(0.4, ref r, ref a2);
        Small.Add(ref a1, ref a2, ref a);
    }
    pt.Stop();
    return pt.ElapsedSeconds;
}

And the results show (in seconds)

Small=0.11765 Big=0.07130     Big is Slower by x0.61

So compared to the mem-copy intensive operators I get a speedup of x3 and x14 which is great, but compare the Small struct times to the Big and you will see that Small is 60% slower than Big.

Can anyone explain this? Does it have to do with CPU pipeline and separating out operations in (spatially) memory makes for more efficient pre-fetch of data?

If you want to try this for yourself grab the code from my dropbox http://dl.dropbox.com/u/11487099/SmallBigCompare.zip

+1  A: 

I can't reproduce your results. On my box, the "ref" version has basically the same performance for Big and Small, within tolerance.

(Running Release mode without the debugger attached, with 10 or 100 times as many iterations just to try to get a nice long run.)

Have you tried running your version for lots of iterations? Is it possible that while the tests are running, your CPU is gradually increasing its clock speed (as it spots that it's having to work hard)?

Jon Skeet
Same here, the operator version is equally fast as the ref version. I think he is running the release version in the debugger. If I do that, I get results similar to jalexiou's.
jdv
I have added a long calculation in the benning of the code to "warm-up" the CPU's. Thanks for the suggestion because I now get the same timing .. very interesting.
jalexiou
+3  A: 

There appear to be a couple of flaws in your benchmark.

  1. Use Stopwatch instead of the PerformanceTimer type. I'm not familiar with the latter and it appears to be a 3rd party component. It's particularly troubling that it's measuring time in EllapsedSeconds instead of EllapsedMilliseconds.
  2. Should run each test twice and count only the second to eliminate potential JIT costs
  3. Marshal.SizeOf is does not produce the actual size of the struct, just it's marshalling size.

After switching to Stopwatch I see the benchmark performing as expected by producing nearly equal times for both types in the static ref case.

JaredPar
PerformanceTimer is taken from <http://www.codeproject.com/csharp/highperformancetimercshar.asp> and uses the [DllImport("Kernel32.dll")] static extern bool QueryPerformanceCounter( out long lpPerformanceCount); [DllImport("Kernel32.dll")] static extern bool QueryPerformanceFrequency( out long lpFrequency);functions
jalexiou
@jalexiou: Yes, but I'd recommend using Stopwatch in modern code. The article predates Stopwatch (it's 8 years old!) - it's cleaner to use the built-in timer. I doubt that that caused any problems, but it's worth knowing about.
Jon Skeet
There are two things that I don't like about `Stopwatch`. a) The resolution is in integer miliseconds (it shoul'd be in micro or nanoseconds). b) The need to Reset() before each evaluation cycle make it a hasle sometimes. The performance counter is always running and I am just polling the current `tics`. My current resolution limit is 0.33ns (freq=3.0 GHz), but realistically I could only measure maybe 10-15ns.
jalexiou
A: 

Agreed with Jared, this is a benchmarking error.

The essence of the issue/discrepancy your seeing is a result of not running a 'warmup' on the tests. This ensures that all types and methods have been loaded in the CLR runtime. You should place a for loop around the main test and always run the benchmarks several times... watch for the change after the first set in the following results:

    Size of Small is 8 bytes
    Size of Big is 80 bytes

    5,000,000.00 Iterations
    Operator Results
      Small=523.00000       Big=1953.00000  Slower=x3.73
    StaticRef Results
      Small=2042.00000      Big=2125.00000  Slower=x1.04
      Small=x0.26   Big=x0.92

    5,000,000.00 Iterations
    Operator Results
      Small=2464.00000      Big=3510.00000  Slower=x1.42
    StaticRef Results
      Small=3578.00000      Big=3647.00000  Slower=x1.02
      Small=x0.69   Big=x0.96

    5,000,000.00 Iterations
    Operator Results
      Small=3921.00000      Big=4817.00000  Slower=x1.23
    StaticRef Results
      Small=4880.00000      Big=4944.00000  Slower=x1.01
      Small=x0.80   Big=x0.97
csharptest.net
A: 

I have a couple of suggestions.

  • Use the Stopwatch class. It uses the exact same Win32 APIs, but is already coded for you.
  • Increase the iteration count so that your benchmarks take at least 1s (or more) to run otherwise anomalies could crop up and dominate the time.
  • Consider the effects of the vshost.exe process. You will get different results for both the Debug and Release builds depending on whether your run the application standalone or through the Visual Studio host process.

When I ran your code I saw similiar results for the pass-by-ref test in all test scenarios. What really stuck out for me was just how much faster the smaller struct was in a Release build standalone (ie. not through vshost.exe).

Release build standalone:

Size of Small is 8 bytes
Size of Big is 80 bytes
50,000,000.00 Iterations
Operator Results
  Small=0.57173 Big=25.58988    Slower=x44.76

StaticRef Results
  Small=26.06602        Big=26.68569    Slower=x1.02
  Small=x0.02   Big=x0.96

Release build through vshost:

Size of Small is 8 bytes
Size of Big is 80 bytes
50,000,000.00 Iterations
Operator Results
  Small=4.56601 Big=35.33387    Slower=x7.74

StaticRef Results
  Small=37.94317        Big=39.64959    Slower=x1.04
  Small=x0.12   Big=x0.89
Brian Gideon
A: 

Thanks everyone for their input. Here are some final thoughts.

PerformanceCounter yields the same results as Stopwatch so that is a no issue.

Final results:

1. For small struct using an operator or by-ref yields the same performance
2. For big struct using by-ref is like 14x faster
3. Big struct is x20 slower than small struct for operators (as expected)
4. Big struct is about 50% slower than small struct with by-ref (still interesting)

So the final question is what is the mechanism that makes the Big struct slower with by-ref since no stack copying should occur?


Results from release executable external to visual studio.

Size of Small is 8 bytes
Size of Big is 80 bytes
5,000,000.00 Iterations
Warming up the CPU's

Using QueryPerformanceCounter
Operator Results
  Small=0.03545 Big=0.71519     Slower=x20.18

StaticRef Results
  Small=0.03526 Big=0.05194     Slower=x1.47
  Small=x1.01   Big=x13.77
jalexiou