views:

550

answers:

7

I've been doing some profiling lately and I've encountered one case which is driving me nuts. The following is a piece of unsafe C# code which basically copies a source sample buffer to a target buffer with a different sample rate. As it is now, it takes up ~0.17% of the total processing time per frame. What I don't get is that if I use floats instead of doubles, the processing time will raise to 0.38%. Could someone please explain what's going on here?

Fast version (~17%)

double rateIncr = ...
double readOffset = ...
double offsetIncr = ...

float v = ... // volume

// Source and target buffers.
float* src = ...
float* tgt = ...

for( var c = 0; c < chunkCount; ++c)
{
    for( var s = 0; s < chunkSampleSize; ++s )
    {
        // Source sample            
        var iReadOffset = (int)readOffset;

        // Interpolate factor
        var k = (float)readOffset - iReadOffset;

        // Linearly interpolate 2 contiguous samples and write result to target.
        *tgt++ += (src[ iReadOffset ] * (1f - k) + src[ iReadOffset + 1 ] * k) * v;

        // Increment source offset.
        readOffset += offsetIncr;
    }
    // Increment sample rate
    offsetIncr += rateIncr;
}

Slow version (~38%)

float rateIncr = ...
float readOffset = ...
float offsetIncr = ...

float v = ... // volume

// Source and target buffers.
float* src = ...
float* tgt = ...

for( var c = 0; c < chunkCount; ++c)
{
    for( var s = 0; s < chunkSampleSize; ++s )
    {
        var iReadOffset = (int)readOffset;

        // The cast to float is removed
        var k = readOffset - iReadOffset;

        *tgt++ += (src[ iReadOffset ] * (1f - k) + src[ iReadOffset + 1 ] * k) * v;
        readOffset += offsetIncr;
    }
    offsetIncr += rateIncr;
}

Odd version(~22%)

float rateIncr = ...
float readOffset = ...
float offsetIncr = ...

float v = ... // volume

// Source and target buffers.
float* src = ...
float* tgt = ...

for( var c = 0; c < chunkCount; ++c)
{
    for( var s = 0; s < chunkSampleSize; ++s )
    {
        var iReadOffset = (int)readOffset;
        var k = readOffset - iReadOffset;

        // By just placing this test it goes down from 38% to 22%,
        // and the condition is NEVER met.
        if( (k != 0) && Math.Abs( k ) < 1e-38 )
        {
           Console.WriteLine( "Denormalized float?" );
        }

        *tgt++ += (src[ iReadOffset ] * (1f - k) + src[ iReadOffset + 1 ] * k) * v;
        readOffset += offsetIncr;
    }
    offsetIncr += rateIncr;
}

All I know by now is that I know nothing

+3  A: 

Perhaps there's a series of double to float conversions happening somewhere that's taking up the CPU time. Can you look at the output with an IL disassembler and see what it's actually doing?

ConcernedOfTunbridgeWells
If that were the case wouldn't using doubles be slower than using floats?
Trap
Depends which way around the conversion is happening and what the optimiser actually does to the code. There is also a (small) possibility that the JIT on the runtime system is cocking something up. This is why I suggested disassembling the IL to see what was actually going on.
ConcernedOfTunbridgeWells
+4  A: 

Are you running this on a 64 or 32 bit processor? My experience has been that in some edge cases there are optimisations the CPU can do with low level functionality like this if the size of your object matches the size of the registers (even though you may assume that two floats would fit neatly in a 64 bit register you may still lose the optimisation benefit). You may find the situation reversed if you run it on a 32 bit system...

A quick search and the best I can do for a cite on this is a couple of posts to C++ game development forums (it was during my one year in game dev that I noticed this myself, but then that was the only time I was profiling to this level). This post has some interesting disassembly results from a C++ method that may be applicable at a very low level.


Another thought:

This article from MSDN goes into a lot of the internal specifics of using floats in .NET primarily to address the problematic issue of float comparison. There is one interesting paragraph from it which sums up the CLR spec for handling float values:

This spec clearly had in mind the x87 FPU. The spec is basically saying that a CLR implementation is allowed to use an internal representation (in our case, the x87 80 bit representation) as long as there is no explicit storage to a coerced location (a class or valuet type field), that forces narrowing. Also, at any point, the IL stream may have conv.r4 and conv.r8 instructions, which will force the narrowing to happen.

So your floats may not actually be floats when operations are being performed against them, instead they could be 80-bit numbers on a x87 FPU or anything else that the compiler may think is an optimisation or required for calculation accuracy. Without looking in the IL you won't know for sure, but there could be many costly casts when you are working with floats that don't hit when you are using doubles. It's a shame that you can't define the required precision for floating point operations in C# as you can through the fp switches in C++, since that would stop the compiler from putting everything into a larger container before operating on it.

Martin Harris
I double checked. it's a 32 bit system.
Trap
A: 

The double to float conversion is probably slow it down at:

(float)readOffset

Try making readOffset float too.

leppie
It's quite the opposite, when I use floats-only I even save a couple of casts and the performace goes much worse.
Trap
+2  A: 

It is possible that your calculations cause float values to enter 'denormal' state, which is very inefficient on most x86 processors. Denormal values are so small that they are at the edge of the smallest possible float value. In contrast, such values would fit comfortably in the double range, so in that case the calculations are efficient.

I can't be sure if this applies to you, but it certainly explains the behavior you're seeing.

http://en.wikipedia.org/wiki/Denormal

Frederik Slijkerman
If it were denormalized floats wouldn't the sound be also screwed?
Trap
No, when interpreted as audio samples denormalized floats are practically equal to silence. The only problem (but a big one) with denormals is that they are so slow.
Frederik Slijkerman
Why not test for denormals in your inner loop? Just check if abs(x) < 1e-38 to see if x is a denormal (for single precision).
Frederik Slijkerman
Trap
Can't explain that... But you should remove (k>0), abs() will take care of that and also detect negative denormals.
Frederik Slijkerman
... on second thought, you should replace (k>0) by (k!=0), otherwise you'd detect regular zeroes.
Frederik Slijkerman
Oh, and check both src[iReadOffset] and the result of the calculation as well.I noticed you've used 'var' as type -- I'm not familiar with C#, but you could try replacing that with 'float' explicitly.
Frederik Slijkerman
OK, looked at it again, and the only denormal possibility I see is if either src[iReadOffset]*(1f-k) or src[iReadOffset]*k becomes denormal. So check that instead. :-)
Frederik Slijkerman
I checked all possibilities and I didn't found any denormalized float.
Trap
+1  A: 

One way to understand what is happening is to break into the debugger at this point in the code and to look at the actual x86 instructions that are being executed. Without knowing your C# translates into machine code, much of what might be suggested as the cause is just guesswork. Even looking at the IL is probably not going to tell you very much.

If you do this, you may want to start the program first and then attach the debugger later so that the JIT optimizations aren't disabled. You want to make sure you're looking at the code you're actually going to run, after all.

Curt Hagenlocher
+1  A: 

Considering the bulk of your code is not dealing with the 3 variables that you switched between doubles and floats, and you're talking about rather large changes in performance, I'd say that the small changes in types and tests is enough to change your cache footprint and/or register usage.

I did some quick tests on my 32 bit machine here:

// NOTE: runnable - copy in paste into your own project
class Program
    {
        static int endVal = 32768;
        static int runCount = 100;
        static void Main(string[] args)
        {
            Stopwatch doublesw = Stopwatch.StartNew();
            for (int i = 0; i < runCount; ++i)
                doubleTest();
            doublesw.Stop();
            Console.WriteLine("Double: " + doublesw.ElapsedMilliseconds);
            Stopwatch floatsw = Stopwatch.StartNew();
            for (int i = 0; i < runCount; ++i)
                floatTest();
            floatsw.Stop();
            Console.WriteLine("Float: " + floatsw.ElapsedMilliseconds);
            Console.ReadLine();
        }

        static void doubleTest()
        {
            double value = 0;
            double incr = 0.001D;

            while (value < endVal)
            {
                value += incr;
            }
        }

        static void floatTest()
        {
            float value = 0;
            float incr = 0.001f;

            while (value < endVal)
            {
                value += incr;
            }
        }
    }
}

and the results were:

Double: 12897
Float: 10059

Repeated tests showed float having a clear advantage over double. Now, this is a small program, and all those variables fit within the registers.

Unfortunately, there were enough missing parts to the code you supplied that i couldn't get a good compile and read of the assembly to see exactly what was going on exactly, but judging from my (quick) testing, this is my answer.

(For me, the giveaway was your case #3 - adding code changes the footprint and your cache patterns - I've seen that kind of strangeness a couple times in various languages)

cyberconte
A: 

Just a short question about your profiling. All you're writing are percentual values. So what about the total time, the function needs??

If you use within your function floats and in the outer place some doubles you need some time for converting, thus meaning the percentual time for the inner function itself drops due to the fact that the processing time for the function itself is constant and the whole process needs more time.

Hope my writing makes any sense and is understandable. But in short words, if your whole process needs a longer total time, the percentual value for a given function (which whole time stays constant, due to the fact that it won't be changed) will drop.

Oliver