views:

1098

answers:

13

Make sure you run outside of the IDE. That is key.

-edit- I LOVE SLaks comment. "The amount of misinformation in these answers is staggering." :D

Calm down guys. Pretty much all of you were wrong. I DID make optimizations. It turns out whatever optimizations I made wasn't good enough. I ran the code in GCC using gettimeofday (I'll paste code below) and used g++ -O2 file.cpp and got slightly faster results then C#. Maybe MS didn't create the optimizations needed in this specific case but after downloading and installing mingw I was tested and found the speed to be near identical. Justicle Seems to be right. I could have sworn I use clock on my PC and used that to count and found it was slower but problem solved. C++ speed isn't almost twice as slower in the MS compiler.

When my friend informed me of this I couldn't believe it. So I took his code and put some timers onto it.

Instead of Boo I used C#. I constantly got faster results in C#. Why? The .NET version was nearly half the time no matter what number I used.

C++ version (bad version):

#include <iostream>
#include <stdio.h>
#include <intrin.h>
#include <windows.h>
using namespace std;

int fib(int n)
{
    if (n < 2) return n;
    return fib(n - 1) + fib(n - 2);
}

int main()
{
    __int64 time = 0xFFFFFFFF;
    while (1)
    {
        int n;
        //cin >> n;
        n = 41;
        if (n < 0) break;
__int64 start = __rdtsc();
        int res = fib(n);
__int64 end = __rdtsc();
        cout << res << endl;
        cout << (float)(end-start)/1000000<<endl;
        break;
    }

    return 0;
}

C++ version (better version):

#include <iostream>
#include <stdio.h>
#include <intrin.h>
#include <windows.h>
using namespace std;

int fib(int n)
{
    if (n < 2) return n;
    return fib(n - 1) + fib(n - 2);
}

int main()
{
    __int64 time = 0xFFFFFFFF;
    while (1)
    {
        int n;
        //cin >> n;
        n = 41;
        if (n < 0) break;
        LARGE_INTEGER start, end, delta, freq;
        ::QueryPerformanceFrequency( &freq );
        ::QueryPerformanceCounter( &start );
        int res = fib(n);
        ::QueryPerformanceCounter( &end );
        delta.QuadPart = end.QuadPart - start.QuadPart;
        cout << res << endl;
        cout << ( delta.QuadPart * 1000 ) / freq.QuadPart <<endl;
break;
    }

    return 0;
}

C# version:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

using System.Runtime.InteropServices;
using System.ComponentModel;
using System.Threading;
using System.IO;

using System.Diagnostics;

namespace fibCSTest
{
    class Program
    {
         static int fib(int n)
         {
            if (n < 2)return n;
            return fib(n - 1) + fib(n - 2);
         }

         static void Main(string[] args)
         {
             //var sw = new Stopwatch();
             //var timer = new PAB.HiPerfTimer();
             var timer = new Stopwatch();
             while (true)
             {
                 int n;
                 //cin >> n;
                 n = 41;
                 if (n < 0) break;
                 timer.Start();
                 int res = fib(n);
                 timer.Stop();
                 Console.WriteLine(res);
                 Console.WriteLine(timer.ElapsedMilliseconds);
                 break;
             }
         }
    }
}

GCC version:

#include <iostream>
#include <stdio.h>
#include <sys/time.h>
using namespace std;

int fib(int n)
{
    if (n < 2) return n;
    return fib(n - 1) + fib(n - 2);
}

int main()
{
    timeval start, end;
    while (1)
    {
        int n;
        //cin >> n;
        n = 41;
        if (n < 0) break;
        gettimeofday(&start, 0);
        int res = fib(n);
        gettimeofday(&end, 0);
        int sec = end.tv_sec - start.tv_sec;
        int usec = end.tv_usec - start.tv_usec;
        cout << res << endl;
        cout << sec << " " << usec <<endl;
        break;
    }

    return 0;
}
A: 

If that code is truly 1/2 the execution time then some possible reasons are:

  • Garbage collection speeds up execution of C# code over C++ code if that were happening anywhere in the above code.
  • The C# writing to the console may be buffered (C++ might not, or it might just not be as efficient)
jwwishart
1) No new so i doubt there is any GC work2) Timer doesnt include the console write/cout. and that wouldnt make c++ seconds slower.
acidzombie24
To clarify, recursion uses memory on the stack, not the heap, so garbage collection isn't used.
Brendan Long
A: 

It could be that the methods are pre-jitted at runtime prior to running the test...or that the Console is a wrapper around the API for outputting to console, when the C++'s code for cout is buffered..I guess..

Hope this helps, Best regards, Tom.

tommieb75
No; the console output is not being timed.
SLaks
what does "It could be that the methods are pre-jitted at runtime prior to running the test" mean?
Justicle
+4  A: 

Don't understand the answer with garbage collection or console buffering.

It could be that your timer mechanism in C++ is inherently flawed.

According to http://en.wikipedia.org/wiki/Rdtsc, it is possible that you get wrong benchmark results.

Quoted:

While this makes time keeping more consistent, it can skew benchmarks, where a certain amount of spin-up time is spent at a lower clock rate before the OS switches the processor to the higher rate. This has the effect of making things seem like they require more processor cycles than they normally would.

Moron
Along those lines, doesn't the rdtsc() return a generic "tick" count-- not necessarily the 1GHz frequency being assumed in the code?
Eric Pi
@Eric: You are probably right, but depends on what the OP actually has. Seems unlikely to be such a nice figure though.
Moron
Correct - RDSTC is a POS. Use QueryPerformanceCounter() and friends.
Justicle
Another RDTSC problem is that on some systems, it is not synchronized between multiple CPU cores/sockets. On such a system, when the OS migrates a thread to a different CPU, the value of RDTSC may appear to "time warp" forwards or backwards. Affinity can be used to work around this, but doesn't help with the clock ramping described in the answer.
bk1e
By the way, on Windows XP with the ACPI multiprocessor HAL, QueryPerformanceCounter() still uses RDTSC. On Vista/7, it uses HPET (a much better timer) when available.
bk1e
+3  A: 

Not saying that's the issue, but you may want to read How to: Use the High-Resolution Timer

Also see this... http://en.wikipedia.org/wiki/Comparison_of_Java_and_C%2B%2B#Performance

Several studies of mostly numerical benchmarks argue that Java could potentially be faster than C++ in some circumstances, for a variety of reasons:[8][9] Pointers make optimization difficult since they may point to arbitrary data, though many C++ compilers provide the C99 keyword restrict which corrects this problem.[10] Compared to C++ implementations which make unrestrained use of standard implementations of malloc/new for memory allocation, implementations of Java garbage collection may have better cache coherence as its allocations are generally made sequentially. * Run-time compilation can potentially use additional information available at run-time to optimise code more effectively, such as knowing what processor the code will be executed on.

It's about Java but begins to tackle the issue of Performance between C runtimes and JITed runtimes.

kervin
+4  A: 

I think the problem is your timing code in C++.

From the MS docs for __rdtsc:

Generates the rdtsc instruction, which returns the processor time stamp. The processor time stamp records the number of clock cycles since the last reset.

Perhaps try GetTickCount().

karoberts
+2  A: 

Maybe C# is able to unroll stack in recursive calls? I think it is also reduces number of computations.

aaa
Wrong on both counts, like most of the other answers.
SLaks
@SLaks, in this particular case he is right. It's inlining at least one call to fib.
MSN
A: 

I know that the .NET compiler has a Intel optimization.

Luiguis
What are you talking about?
SLaks
+1 But what about AMD?
FFire
It probably does. C++ compilers tend to have optimizations as well.
Justicle
A: 

Speculation 1

Garbage collection procedure might play a role.

In the C++ version all memory management would occur inline while the program is running, and that would count into the final time.

In .NET the Garbage Collector (GC) of the Common Language Runtime (CLR) is a separate process on a different thread and often cleans up your program after it's completed. Therefore your program will finish, the times will print out before memory is freed. Especially for small programs which usually won't be cleaned up at all until completion.

It all depends on details of the Garbage Collection implementation (and if it optimizes for the stack in the same way as the heap) but I assume this plays a partial role in the speed gains. If the C++ version was also optimized to not deallocate/clean up memory until after it finished (or push that step until after the program completed) then I'm sure you would see C++ speed gains.

To Test GC: To see the "delayed" .NET GC behaviour in action, put a breakpoint in some of your object's destructor/finalizer methods. The debugger will come alive and hit those breakpoints after the program is completed (yes, after Main is completed).

Speculation 2

Otherwise, the C# source code is compiled by the programmer down to IL code (Microsoft byte code instructions) and at runtime those are in turn compiled by the CLR's Just-In-Time compiler into an processor-specific instruction set (as with classic compiled programs) so there's really no reason a .NET program should be slower once it gets going and has run the first time.

John K
@SLaks: Try me. I'm curious to know if these are true or just more folklore?
John K
First of all, the term Garbage Collection does not apply to C++. I don't know enough about memory management to confirm or reject your answer.
SLaks
@SLaks: Fixed by saying C++ "memory management" instead of referring to a GC.
John K
There is no memory allocation or object creation in either version of the profiled code. Speculation 2 doesn't really say anything, its just a statement about how C# gets compiled.
Justicle
@Justicle: `new Stopwatch()` creates an object in the C# code - there is heap allocation. Statement #2 is long but serves to provide valuable insight into the author's question about "faster" especially if approaching the subject matter for the first time.
John K
The allocation of the StopWatch doesn't occur *within the profiled code*. I imagine there's all sorts of allocations going on before then as well. Speculation 2 doesn't add anything as to why the benchmarks would be different.
Justicle
+8  A: 
Justicle
It's faster on my machine when running the C# version without the debugger attached.
MSN
'RDSTC is a POS', LOL, how do you think `QueryPerformanceCounter` is implemented? If you dont know how to use RDSTC on your specific CPU/model, then it will be a POS.
leppie
@leppie. I don't know what you think, but in fact `QueryPerformanceCounter` is implemented to use any high performance counter on the system. If there's no high performance counter available, then it's probably no better than RDSTC http://msdn.microsoft.com/en-us/library/ms644900(VS.85).aspx#high_resolution
MarkJ
@MarkJ: And what do you see when you disassemble `RtlQueryPerformanceCounter` in NTDLL.dll? It calls RDSTC.
leppie
@leppie. It calls RDSTC if it detects the results are reliable. In other words it tries hard to avoid the situations when RDSTC is poor http://blogs.msdn.com/oldnewthing/archive/2005/09/02/459952.aspx
MarkJ
@MarkJ: From that post (that is very old) it tries to use RDTSC (just noticed everyone is misspelling this...) as preferred source. It mentions buggy hardware, and does not provide a list of this. I can guarantee you on my i7, it uses the CPU's timestamp counter. The Intel System Programming Guide also says this is good for Nehalem-based processors. My point still stands. RDTSC is only a POS if you dont know how to use it for your CPU.
leppie
@leppie the OP was comparing cycles to milliseconds.
Justicle
A: 

I think everyone here has missed the "secret ingredient" that makes all the difference: The JIT compiler knows exactly what the target architecture is, whereas a static compiler does not. Different x86 processors have very different architectures and pipelines, so a sequence of instructions that is the fastest possible on one CPU might be relatively slower on another.

In this case the Microsoft C++ compiler's optimization strategy was targeted to a different processor than the CPU acidzombie24 was actually using, but gcc chose instructions more suited to his CPU. On a newer, older, or different-manufacturer CPU it is likely Microsoft C++ would be faster than gcc.

JIT has the best potential of all: Since it knows exactly what CPU is being targeted it has the ability to generate the very best possible code in every situation. Thus C# is inherently (in the long term) likely to be faster than C++ for such code.

Having said this, I would guess that the fact that CLR's JIT picked a better instruction sequence than Microsoft C++ was more a matter of luck than knowing the architecture. This is evidenced by the fact that on Justicle's CPU the Microsoft C++ compiler selected a better instruction sequence than the CLR JIT compiler.

A note on _rdtsc vs QueryPerformanceCounter: Yes _rdtsc is broken, but when you're talking a 3-4 second operation and running it several times to validate consistent timing, any situation that causes _rdtsc to give bogus timings (such as processor speed changes or processor changes) should cause outlying values in the test data that will be thrown out, so assuming acidzombie24 did his original benchmarks properly I doubt the _rdtsc vs QueryPerformanceCounter question really had any impact.

Ray Burns
"I doubt the _rdtsc vs QueryPerformanceCounter question really had any impact.", well I actually ran the code, and it did. The existing timing code is just wrong. The OP verified this with a third method - 'GetTimeOfDay'
Justicle
@Justicle: I understand that. My point is that the calibration of _rdtsc vs QueryPerformanceCounter should have taken care of this. Either the calibration was not done properly (or at all) or there was some other error in the test methodology. My guess is that the calibration step was completely omitted. Did YOU try calibrating _rdtsc with Stopwatch()? If so, what did you discover? What ratios were found, and what variability? These need to be measured before blame can be assigned. My guess is that if they are measured, _rdtsc will be exonerated.
Ray Burns
i copied that timing code from elsewhere (here on SO) assuming its right. I didnt calibrate or do anything. Now i use Justicle code. But luckily the only important thing i timed so far was on linux using gettimeofday
acidzombie24
@ray yes, as I said, I ran the original code first, got results that where out by a factor of four, then switched to QueryPerformance* . So as I said, the original timing code is just wrong.
Justicle
Updated answer - explaining why the original timing code is wrong, benchmarks weren't done properly.
Justicle
+12  A: 

EDIT: TL/DR version: CLR JIT will inline one level of recursion, MSVC 8 SP1 will not without #pragma inline_recursion(on). And you should run the C# version outside of a debugger to get the fully optimized JIT.

I got similar results to acidzombie24 with C# vs. C++ using VS 2008 SP1 on a Core2 Duo laptop running Vista plugged in with "high performance" power settings (~1600 ms vs. ~3800 ms). It's kind of tricky to see the optimized JIT'd C# code, but for x86 it boils down to this:

00000000 55               push        ebp  
00000001 8B EC            mov         ebp,esp 
00000003 57               push        edi  
00000004 56               push        esi  
00000005 53               push        ebx  
00000006 8B F1            mov         esi,ecx 
00000008 83 FE 02         cmp         esi,2 
0000000b 7D 07            jge         00000014 
0000000d 8B C6            mov         eax,esi 
0000000f 5B               pop         ebx  
00000010 5E               pop         esi  
00000011 5F               pop         edi  
00000012 5D               pop         ebp  
00000013 C3               ret              
            return fib(n - 1) + fib(n - 2);
00000014 8D 7E FF         lea         edi,[esi-1] 
00000017 83 FF 02         cmp         edi,2 
0000001a 7D 04            jge         00000020 
0000001c 8B DF            mov         ebx,edi 
0000001e EB 19            jmp         00000039 
00000020 8D 4F FF         lea         ecx,[edi-1] 
00000023 FF 15 F8 2F 12 00 call        dword ptr ds:[00122FF8h] 
00000029 8B D8            mov         ebx,eax 
0000002b 4F               dec         edi  
0000002c 4F               dec         edi  
0000002d 8B CF            mov         ecx,edi 
0000002f FF 15 F8 2F 12 00 call        dword ptr ds:[00122FF8h] 
00000035 03 C3            add         eax,ebx 
00000037 8B D8            mov         ebx,eax 
00000039 4E               dec         esi  
0000003a 4E               dec         esi  
0000003b 83 FE 02         cmp         esi,2 
0000003e 7D 04            jge         00000044 
00000040 8B D6            mov         edx,esi 
00000042 EB 19            jmp         0000005D 
00000044 8D 4E FF         lea         ecx,[esi-1] 
00000047 FF 15 F8 2F 12 00 call        dword ptr ds:[00122FF8h] 
0000004d 8B F8            mov         edi,eax 
0000004f 4E               dec         esi  
00000050 4E               dec         esi  
00000051 8B CE            mov         ecx,esi 
00000053 FF 15 F8 2F 12 00 call        dword ptr ds:[00122FF8h] 
00000059 03 C7            add         eax,edi 
0000005b 8B D0            mov         edx,eax 
0000005d 03 DA            add         ebx,edx 
0000005f 8B C3            mov         eax,ebx 
00000061 5B               pop         ebx  
00000062 5E               pop         esi  
00000063 5F               pop         edi  
00000064 5D               pop         ebp  
00000065 C3               ret  

In contrast to the C++ generated code (/Ox /Ob2 /Oi /Ot /Oy /GL /Gr):

int fib(int n)
{ 
00B31000 56               push        esi  
00B31001 8B F1            mov         esi,ecx 
    if (n < 2) return n; 
00B31003 83 FE 02         cmp         esi,2 
00B31006 7D 04            jge         fib+0Ch (0B3100Ch) 
00B31008 8B C6            mov         eax,esi 
00B3100A 5E               pop         esi  
00B3100B C3               ret              
00B3100C 57               push        edi  
    return fib(n - 1) + fib(n - 2); 
00B3100D 8D 4E FE         lea         ecx,[esi-2] 
00B31010 E8 EB FF FF FF   call        fib (0B31000h) 
00B31015 8D 4E FF         lea         ecx,[esi-1] 
00B31018 8B F8            mov         edi,eax 
00B3101A E8 E1 FF FF FF   call        fib (0B31000h) 
00B3101F 03 C7            add         eax,edi 
00B31021 5F               pop         edi  
00B31022 5E               pop         esi  
} 
00B31023 C3               ret              

The C# version basically inlines fib(n-1) and fib(n-2). For a function that is so call heavy, reducing the numbeer of funcyion calls is the key to speed. Replacing fib with the following:

int fib(int n);

int fib2(int n) 
{ 
    if (n < 2) return n; 
    return fib(n - 1) + fib(n - 2); 
} 

int fib(int n)
{ 
    if (n < 2) return n; 
    return fib2(n - 1) + fib2(n - 2); 
} 

Gets it down to ~1900 ms. Incidentally, if I use #pragma inline_recursion(on) I get similar results with the original fib. Unrolling it one more level:

int fib(int n);

int fib3(int n) 
{ 
    if (n < 2) return n; 
    return fib(n - 1) + fib(n - 2); 
} 

int fib2(int n) 
{ 
    if (n < 2) return n; 
    return fib3(n - 1) + fib3(n - 2); 
} 

int fib(int n)
{ 
    if (n < 2) return n; 
    return fib2(n - 1) + fib2(n - 2); 
} 

Gets it down to ~1380 ms. Beyond that it tapers off.

So it appears that the CLR JIT for my machine will inline recursive calls one level, whereas the C++ compiler will not do that by default.

If only all performance critical code were like fib!

MSN
+1 Spot on - I was timing C# within the debugger, running on commandline improved speed quite a bit. I updated my answer with better results.
Justicle
Just curious - how did you the get the ASM for the JIT code? Also, I got no improvement with inline_recursion(on) and Ob2, but there was improvement via manually unrolling. I must be missing a compiler setting...
Justicle
@justicle, I added a `while (true) { }` and attached to the process. As for the C++ version, I turned on "Omit frame layout" and changed the calling convention to `__fastcall`.
MSN
BTW the gcc generated code is even hairier (it performed tail call optimization when I tried it some time ago.
jpalecek
c++ code doesn't have static keyword on fib function, while c# has, that's why timings are wrong.. code is badly translated
fazo
@fazo, specifying `static` on `fib` does not ultimately affect link time code generation in this case. It inlines some of the code into main, but that's about it. The expansion of one level of recursion is a far more effective optimization.
MSN
+1  A: 

One important thing to remember when comparing languages is that if you do a simple line-by-line translation, you're not comparing apples to apples.

What makes sense in one language may have horrible side effects in another. To really compare the performance characteristics you need a C# version and a C++, and the code for those versions may be very different. For example, in C# I wouldn't even use the same function signature. I'd go with something more like this:

IEnumerable<int> Fibonacci()
{
   int n1 = 0;
   int n2 = 1;

   yield return 1;
   while (true)
   {
      int n = n1 + n2;
      n1 = n2;
      n2 = n;
      yield return n;
   }
}

and then wrap that like this:

public static int fib(int n)
{
    return Fibonacci().Skip(n).First();
}

That will do much better, because it works from the bottom up to take advantage of the calculations in the last term to help build the next one, rather than two separate sets of recursive calls.

And if you really want screaming performance in C++ you can use meta-programming to make the compiler pre-compute your results like this:

template<int N> struct fibonacci
{
    static const int value = fibonacci<N - 1>::value + fibonacci<N - 2>::value;
};

template<> struct fibonacci<1>
{
    static const int value = 1;
};

template<> struct fibonacci<0>
{
    static const int value = 0;
};
Joel Coehoorn
Compile time fibonacci is certainly not apples to apples :-)
Justicle
A: 

you are calling static function in c# code which will be inlined, and in c++ you use nonstatic function. i have ~1.4 sec for c++. with g++ -O3 you can have 1.21 sec.

you just can't compare c# with c++ with badly translated code

fazo
badly translate? no way. In both cases the function can be seen outside of the file and class. The way you are using static sounds like the internal keyword in C# which isnt used (limiting the code to the file doesnt exist in C#). So the translation is pretty accurate. I am not going to say static in C++ wont give better performance which i know it could
acidzombie24
c# version clearly has: static int fib(int n)... and is a member of a class... ofcourse it's gonna be inlined. if you want to have the same code for c++ you have to make a class with member function fib - which by default will be inlined as c# is.
fazo