You are using floating point code! Compilers are rubbish with floating point code.
Here's some measurements I've made, I'm using DevStudio 2005 with default optimistations and I changed the code slightly:
// added to the inner part of the loop
fn_value += j;
// added a dependancy on fn_value so that the compiler doesn't optimise the
// whole code down to nothing
printf("Time taken %lf - %f", (double) (end-start) / CLOCKS_PER_SEC, fn_value);
So, I get this running in about 5s.
Now, I changed the code a bit:
# include <stdio.h>
# include <time.h>
int main(int argc, char **argv)
{
int fn_value=0;
int n=10,i,j;
unsigned int k;
clock_t start, end;
start = clock();
for(k=0;k<9765625;k++)
{
for(i=0;i<n;i++)
{
for(j=i;j<n;j++)
fn_value+=j;
}
}
end= clock();
printf("Time taken %lf - %d", (double) (end-start) / CLOCKS_PER_SEC, fn_value);
return 0;
}
I changed the fn_value to an int. It now takes about a second! So there's a four second overhead between adding ints and adding floats. I then wrote a version with IA32 FPU opcodes instead of the C code and got about 1.4 seconds which isn't that much slower than using ints.
Then, I used the C floating point version but made fn_value a double and the time became 1.25s. Now, that surprised me. It beat the FPU opcode version, but, looking at the dissembly, the only difference is the the pure C version unrolled the inner loop.
Also, when using floats, the result is incorrect.
Here's my final test code:
# include <stdio.h>
# include <time.h>
void p1 ()
{
double fn_value=0;//if this is a float, the answer is slightly wrong
int n=10,i,j;
unsigned int k;
clock_t start, end;
start = clock();
__asm fldz;
for(k=0;k<9765625;k++)
{
for(i=0;i<n;i++)
{
for(j=i;j<n;j++)
__asm {
fiadd j
}
}
}
__asm fstp fn_value;
end= clock();
printf("p1: Time taken %lf - %lf\n", (double) (end-start) / CLOCKS_PER_SEC, (double) fn_value);
}
void p2 ()
{
double fn_value=0;
int n=10,i,j;
unsigned int k;
clock_t start, end;
start = clock();
for(k=0;k<9765625;k++)
{
for(i=0;i<n;i++)
{
for(j=i;j<n;j++)
fn_value+=j;
}
}
end= clock();
printf("p2: Time taken %lf - %lf\n", (double) (end-start) / CLOCKS_PER_SEC, (double) fn_value);
}
void p3 ()
{
float fn_value=0;
int n=10,i,j;
unsigned int k;
clock_t start, end;
start = clock();
for(k=0;k<9765625;k++)
{
for(i=0;i<n;i++)
{
for(j=i;j<n;j++)
fn_value+=j;
}
}
end= clock();
printf("p3: Time taken %lf - %lf\n", (double) (end-start) / CLOCKS_PER_SEC, (double) fn_value);
}
int main(int argc, char **argv)
{
p1 ();
p2 ();
p3 ();
return 0;
}
In summary, double appears to be faster than float. However, we need to see the contents of that inner loop to see if converting the floating point type will provide any speed up in your specific case.
UPDATE
The reason the float version is slower than the others is because the float version is constantly writing and reading the value to/from memory. The double and hand-written versions never write the value to RAM. Why does it do this. The main reason that I can think of is to decrease the precision of the value of fn_value between operations. Internally, the FPU is 80bit whereas a float is 32bit (in this implementation of C). To keep the values within the range of a float, the compiler is converting from 80bit to 32bit by writing and reading the value to/from RAM because, as far as I know, there is no FPU instruction to do this to a single FPU register. So, in order to keep the maths '32bit' (of type float) it introduces a huge overhead. The compiler is ignoring the dfference between the 80bit FPU and 64bit double type and is assuming the programmer wants the bigger type as much as possible.