ansaurus

Question

Answer 1

+1 A:

As there's no built in function to do that, using array indexes is the fastest operation you could think of. Indexers and solutions like that only make things worse by introducing method calls and preventing the JIT optimizer to be able to optimize bound checks.

Anyway, I think your current method is the fastest non-unsafe solution you could be using. If performance really matter to you (which usually does in signal processing applications), you can do the whole thing in unsafe C# (which is fast enough, probably comparable with C) and wrap it in a method that you'd call from your safe methods.

Mehrdad Afshari 2009-06-07 11:35:12

Answer 2

A:

I think a lot of readers would question why you don't want an unsafe solution for something like audio processing. It's the type of thing that begs for hot-blooded optimization and I would peronally be unhappy knowing it's being forced through a vm.

2009-06-07 11:55:52

There's no vm involved in safe code.

Henk Holterman 2009-06-07 12:00:30

It's JIT compiled. The problem is not interpretation but array bound checks.

Mehrdad Afshari 2009-06-07 12:09:00

I believe in safe code, and that it can match unsafe code by system dependent optimization. The moment you go unsafe, you are optimizing for a specific system and that destroys the whole point in using C#. If I wanted unsafe code, I would have used C++ but I want portability and speed at the same time.Basically, I’m trying to prove that things like signal processing can work just as fast in a managed language.

MaXKilleR 2009-06-07 12:12:59

MaXKilleR: Unsafe can still be JIT optimized. It's not native code by any means. It has most if the benefits of safe code.

Mehrdad Afshari 2009-06-07 12:22:01

Unverifiable code cannot be optimized in the same way. There are many ways to write code with pointer arithmetics and meaningful optimization is only achieved by developer modifications. C/C++ code is not machine code and thus is not native code (until you compile it), at least in my point of view.Besides, I want my code to run on any system regardless of security constraints.

MaXKilleR 2009-06-07 13:21:13

Henk: The Common Language Runtime is a Virtual Machine.

2009-06-08 01:06:35

Answer 3

+1 A:

It's not going to get you a major performance boost (I roughly measured 20% on my machine), but you could consider some loop unrolling for common cases. If most of the time you have a relatively limited number of channels:

static private float[][] Alternative(float[] buffer, int channels)
{
    float[][] result = new float[channels][];
    int length = buffer.Length / channels;
    for (int c = 0; c < channels; c++)
        result[c] = new float[length];

    int i = 0;
    if (channels == 8)
    {
        for (int ix = 0; ix < length; ix++)
        {
            result[0][ix] = buffer[i++];
            result[1][ix] = buffer[i++];
            result[2][ix] = buffer[i++];
            result[3][ix] = buffer[i++];
            result[4][ix] = buffer[i++];
            result[5][ix] = buffer[i++];
            result[6][ix] = buffer[i++];
            result[7][ix] = buffer[i++];
        }
    }
    else
        for (int ix = 0; ix < length; ix++)
            for (int ch = 0; ch < channels; ch++)
                result[ch][ix] = buffer[i++];


    return result;
}

As long as you leave the general fallback variant there it'll handle any number of channels, but you'll get a speed boost if it is one of the unrolled variants.

jerryjvl 2009-06-07 12:35:12

Along these lines you might be able to dynamically generate the unrolled version on the fly....

Dolphin 2009-06-07 13:07:13

I thought I would lose speed with that method, since I am accessing one array per iteration and the CLS will not need to reload sections. As far as I know it loads sections from an array, so accessing the next element in the next operation will be faster.

MaXKilleR 2009-06-07 13:37:14

My quick test showed that for the 8 channels I use as an example and a fairly large buffer I gain about 20%. Not earth-shattering, but anything helps I guess? ... You may want to perform some realistic tests based on what you are actually doing though to make sure you get better performance for your scenario.

jerryjvl 2009-06-07 13:42:42

Answer 4

+4 A:

I ran some tests and here is the code I tested:

delegate(float[] inout)
{ // My Original Code
    float[][] tempbuf = new float[2][];
    int length = inout.Length / 2;
    for (int c = 0; c < 2; c++)
    {
        tempbuf[c] = new float[length];
        for (int i = 0, offset = c; i < tempbuf[c].Length; i++, offset += 2)
            tempbuf[c][i] = inout[offset];
    }
}
delegate(float[] inout)
{ // jerryjvl's recommendation: loop unrolling
    float[][] tempbuf = new float[2][];
    int length = inout.Length / 2;
    for (int c = 0; c < 2; c++)
        tempbuf[c] = new float[length];
    for (int ix = 0, i = 0; ix < length; ix++)
    {
        tempbuf[0][ix] = inout[i++];
        tempbuf[1][ix] = inout[i++];
    }

}
delegate(float[] inout)
{ // Unsafe Code
    unsafe
    {
        float[][] tempbuf = new float[2][];
        int length = inout.Length / 2;
        fixed (float* buffer = inout)
            for (int c = 0; c < 2; c++)
            {
                tempbuf[c] = new float[length];
                float* offset = buffer + c;
                fixed (float* buffer2 = tempbuf[c])
                {
                    float* p = buffer2;
                    for (int i = 0; i < length; i++, offset += 2)
                        *p++ = *offset;
                }
            }
    }
}
delegate(float[] inout)
{ // Modifying my original code to see if the compiler is not as smart as i think it is.
    float[][] tempbuf = new float[2][];
    int length = inout.Length / 2;
    for (int c = 0; c < 2; c++)
    {
        float[] buf = tempbuf[c] = new float[length];
        for (int i = 0, offset = c; i < buf.Length; i++, offset += 2)
            buf[i] = inout[offset];
    }
}

and results: (buffer size = 2^17, number iterations timed per test = 200)

Average for test #1:      0.001286 seconds +/- 0.000026
Average for test #2:      0.001193 seconds +/- 0.000025
Average for test #3:      0.000686 seconds +/- 0.000009
Average for test #4:      0.000847 seconds +/- 0.000008

Average for test #1:      0.001210 seconds +/- 0.000012
Average for test #2:      0.001048 seconds +/- 0.000012
Average for test #3:      0.000690 seconds +/- 0.000009
Average for test #4:      0.000883 seconds +/- 0.000011

Average for test #1:      0.001209 seconds +/- 0.000015
Average for test #2:      0.001060 seconds +/- 0.000013
Average for test #3:      0.000695 seconds +/- 0.000010
Average for test #4:      0.000861 seconds +/- 0.000009

I got similar results every test. Obviously the unsafe code is the fastest, but I was surprised to see that the CLS couldn't figure out that that it can drop the index checks when dealing with jagged array. Maybe someone can think of more ways to optimize my tests.

Edit: I tried loop unrolling with the unsafe code and it didn't have an effect. I also tried optimizing the loop unrolling method:

delegate(float[] inout)
{
    float[][] tempbuf = new float[2][];
    int length = inout.Length / 2;
    float[] tempbuf0 = tempbuf[0] = new float[length];
    float[] tempbuf1 = tempbuf[1] = new float[length];

    for (int ix = 0, i = 0; ix < length; ix++)
    {
        tempbuf0[ix] = inout[i++];
        tempbuf1[ix] = inout[i++];
    }
}

The results are also a hit-miss compared test#4 with 1% difference. Test #4 is my best way to go, so far.

As I told jerryjvl, the problem is getting the CLS to not index check the input buffer, since adding a second check (&& offset < inout.Length) will slow it down...

Edit 2: I ran the tests before in the IDE, so here are the results outside:

2^17 items, repeated 200 times
******************************************
Average for test #1:      0.000533 seconds +/- 0.000017
Average for test #2:      0.000527 seconds +/- 0.000016
Average for test #3:      0.000407 seconds +/- 0.000008
Average for test #4:      0.000374 seconds +/- 0.000008
Average for test #5:      0.000424 seconds +/- 0.000009

2^17 items, repeated 200 times
******************************************
Average for test #1:      0.000547 seconds +/- 0.000016
Average for test #2:      0.000732 seconds +/- 0.000020
Average for test #3:      0.000423 seconds +/- 0.000009
Average for test #4:      0.000360 seconds +/- 0.000008
Average for test #5:      0.000406 seconds +/- 0.000008


2^18 items, repeated 200 times
******************************************
Average for test #1:      0.001295 seconds +/- 0.000036
Average for test #2:      0.001283 seconds +/- 0.000020
Average for test #3:      0.001085 seconds +/- 0.000027
Average for test #4:      0.001035 seconds +/- 0.000025
Average for test #5:      0.001130 seconds +/- 0.000025

2^18 items, repeated 200 times
******************************************
Average for test #1:      0.001234 seconds +/- 0.000026
Average for test #2:      0.001319 seconds +/- 0.000023
Average for test #3:      0.001309 seconds +/- 0.000025
Average for test #4:      0.001191 seconds +/- 0.000026
Average for test #5:      0.001196 seconds +/- 0.000022

Test#1 = My Original Code
Test#2 = Optimized safe loop unrolling
Test#3 = Unsafe code - loop unrolling
Test#4 = Unsafe code
Test#5 = My Optimized Code

Looks like loop unrolling is not favorable. My optimized code is still my best way to go and with only 10% difference compared to the unsafe code. If only I could tell the compiler that (i < buf.Length) implies that (offset < inout.Length), it will drop the check (inout[offset]) and I will basically get the unsafe performance.

MaXKilleR 2009-06-07 13:55:46

I think at this stage the question goes back to 'fast enough' ;) ... if the perf now keeps up with your needs then I'd pick the cleanest, easiest to follow implementation, and maybe leave the most optimal version with it in a comment... or the other way around.

jerryjvl 2009-06-07 16:18:23

Well my original code was fast enough. I did not see any performance hits from; decoding 3 mp3 files (on the fly) with different sample rate, resampling, mixing, sending to winmm and openal. However, because I started pushing bitwise instead of base two math and replacing everything possible with Buffer.BlockCopy, I thought knowing the best way to deal with this problem will ensure good performance on less powerful machines (ex. windows mobile devices).

MaXKilleR 2009-06-07 16:57:19

Answer 5

+1 A:

Maybe some unrolling in your own best answer:

delegate(float[] inout)
{
    unsafe
    {
        float[][] tempbuf = new float[2][];
        int length = inout.Length / 2;

        fixed (float* buffer = inout)
        {
            float* pbuffer = buffer;

            tempbuf[0] = new float[length];
            tempbuf[1] = new float[length];

            fixed (float* buffer0 = tempbuf[0])
            fixed (float* buffer1 = tempbuf[1])
            {
                float* pbuffer0 = buffer0;
                float* pbuffer1 = buffer1;

                for (int i = 0; i < length; i++)
                {
                    *pbuffer0++ = *pbuffer++;
                    *pbuffer1++ = *pbuffer++;
                }
            }
        }
    }
}

This might get a little more performance still.

jerryjvl 2009-06-07 14:23:00

MaXKilleR 2009-06-07 14:51:13

If you really need top performance, then you may need to check the IL produced from the best answer you have so far and see if there is anything redundant that can be trimmed out.

jerryjvl 2009-06-07 15:08:24

PS: I assume you have been doing all measurements outside the IDE, right?

jerryjvl 2009-06-07 15:10:07

I know what's redundant. The index being checked by the input buffer, since it doesn't know that if (i < buf.Length) is true then (offset < inout.Length) must be true. That's the 20% difference in speed that the unsafe code has.I tested in the IDE. I didn't think that testing out will effect each test differently. I'll edit the test results.

MaXKilleR 2009-06-07 15:37:43

The IDE does weird things to performance measurements... only reliable performance measurement is running a 'Release' app directly.

jerryjvl 2009-06-07 15:41:54

I agree with jerryjvl. Compile as release for testing C# with pointers. Testing the speed in debug mode will give you atypical (slower) results.

exceptionerror 2009-06-09 10:32:01

ansaurus

tags:

views:

answers:

De / Interleave array fast in C#

related questions