views:

393

answers:

4

I'm currently optimizing an application, one of the operations that is done very often is reading and writing binary. I need 2 types of functions:

Set(byte[] target, int index, int value);

int Get(byte[] source, int index);

These functions are needed for signed and unsigned short, int and long in big and little endian order.

Here are some examples i've made, but i need a evaluation about the advantages and disadvantages:

first method is using Marshal to write the value into the memory of the byte[], the second is using plain pointers to accomplish this and the third uses BitConverter and BlockCopy to do this

unsafe void Set(byte[] target, int index, int value)
{
    fixed (byte* p = &target[0])
    {
        Marshal.WriteInt32(new IntPtr(p), index, value);
    }
}

unsafe void Set(byte[] target, int index, int value)
{
    int* p = &value;
    for (int i = 0; i < 4; i++)
    {
        target[offset + i] = *((byte*)p + i);
    }
}

void Set(byte[] target, int index, int value)
{
    byte[] data = BitConverter.GetBytes(value);
    Buffer.BlockCopy(data, 0, target, index, data.Length);
}

And here are the Read/Get methods:

the first is using Marshal to read the value from the byte[], the second is using plain pointers and the third is using BitConverter again:

unsafe int Get(byte[] source, int index)
{
    fixed (byte* p = &source[0])
    {
        return Marshal.ReadInt32(new IntPtr(p), index);
    }
}

unsafe int Get(byte[] source, int index)
{
    fixed (byte* p = &source[0])
    {
        return *(int*)(p + index);
    }
}

unsafe int Get(byte[] source, int index)
{
    return BitConverter.ToInt32(source, index);
}

boundary checking needs to be done but isn't part of my question yet...

I would be pleased if someone can tell what would be the best and fastest way in this case or give me some other solutions to work on. A generic solution would be preferable


I Just did some performance testing, here are the results:

Set Marshal: 45 ms, Set Pointer: 48 ms, Set BitConverter: 71 ms Get Marshal: 45 ms, Get Pointer: 26 ms, Get BitConverter: 30 ms

it seems that using pointers is the fast way, but i think Marshal and BitConverter do some internal checking... can someone verify this?

+1  A: 

You should do some profiling on your code to reveal whether this is the bottleneck. Also looking at your code it appears that you are using .Net function calls to write one byte to an unmanaged array, involving a pin on the memory and a call to unsafe code...

You might be much better off declaring a .Net System.IO.MemoryStream and seeking and writing around to it, wherever possible using a stream writer to push your changes in, which should use less function calls and won't require unsafe code. You'll find the pointer stuff much more useful in C# if you are doing things like DSP, where you need to perform a single operation to every value in an array etc.

EDIT: Let me also mention that depending on what you are doing you might find that the CPU caching will come into effect, if you can keep working on a single small area of memory that fits into the cache then you will end up with the best performance.

Spence
The problem is it can be a bottleneck cause the application is communicating with a load of different network devices and is running on low-cost machines, some of these devices use heavy protocols others don't. Do you know a good way to profile the interfaces? problem will be the variable delay of the network..
haze4real
+5  A: 

Important: if you only need the one endian, see the pointer magic by wj32 / dtb


Personally, I would be writing directly to a Stream (perhaps with some buffering), and re-using a shared buffer that I can generally assume is clean. Then you can make some shortcuts and assume index 0/1/2/3.

Certainly don't use BitConverter, as that can't be used for both little/big-endian, which you require. I would also be inclined to just use bit-shifting rather than unsafe etc. It is actally the fastest, based on the following (so I'm glad that this is how I already do it my code here, look for EncodeInt32Fixed):

Set1: 371ms
Set2: 171ms
Set3: 993ms
Set4: 91ms <==== bit-shifting ;-p

code:

using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
static class Program
{
    static void Main()
    {
        const int LOOP = 10000000, INDEX = 100, VALUE = 512;
        byte[] buffer = new byte[1024];
        Stopwatch watch;

        watch = Stopwatch.StartNew();
        for (int i = 0; i < LOOP; i++)
        {
            Set1(buffer, INDEX, VALUE);
        }
        watch.Stop();
        Console.WriteLine("Set1: " + watch.ElapsedMilliseconds + "ms");

        watch = Stopwatch.StartNew();
        for (int i = 0; i < LOOP; i++)
        {
            Set2(buffer, INDEX, VALUE);
        }
        watch.Stop();
        Console.WriteLine("Set2: " + watch.ElapsedMilliseconds + "ms");

        watch = Stopwatch.StartNew();
        for (int i = 0; i < LOOP; i++)
        {
            Set3(buffer, INDEX, VALUE);
        }
        watch.Stop();
        Console.WriteLine("Set3: " + watch.ElapsedMilliseconds + "ms");

        watch = Stopwatch.StartNew();
        for (int i = 0; i < LOOP; i++)
        {
            Set4(buffer, INDEX, VALUE);
        }
        watch.Stop();
        Console.WriteLine("Set4: " + watch.ElapsedMilliseconds + "ms");

        Console.WriteLine("done");
        Console.ReadLine();
    }
    unsafe static void Set1(byte[] target, int index, int value)
    {
        fixed (byte* p = &target[0])
        {
            Marshal.WriteInt32(new IntPtr(p), index, value);
        }
    }

    unsafe static void Set2(byte[] target, int index, int value)
    {
        int* p = &value;
        for (int i = 0; i < 4; i++)
        {
            target[index + i] = *((byte*)p + i);
        }
    }

    static void Set3(byte[] target, int index, int value)
    {
        byte[] data = BitConverter.GetBytes(value);
        Buffer.BlockCopy(data, 0, target, index, data.Length);
    }
    static void Set4(byte[] target, int index, int value)
    {
        target[index++] = (byte)value;
        target[index++] = (byte)(value >> 8);
        target[index++] = (byte)(value >> 16);
        target[index] = (byte)(value >> 24);
    }
}
Marc Gravell
i don't think Stream would be a good solution, the problem is there may be seeks needed and the data is not always read and writen sequential. Another problem would be endianess..
haze4real
Fine - don't use the stream; but see `Set4` and the performance evidence...
Marc Gravell
i need to verify this first and may accept this answer as solution, what about getting/reading?
haze4real
Same, but in reverse; `return ((int)buffer[index++]) | (((int)buffer[index++]) << 8) | (((int)buffer[index++]) << 16) | (((int)buffer[index]) << 24);` (or switch the shifts top-to-bottom to get other-endian). Note that we cast to int early, as int arithmetic is faster than byte arithmetic.
Marc Gravell
i think the shifting solution is the best at this time, the big advantage would be the ease of endian swaping.
haze4real
+2  A: 

Pointers are the way to go. Pinning objects with the fixed keyword is extremely cheap, and you avoid the overhead of calling functions like WriteInt32 and BlockCopy. For a "generic solution" you can simply use void* and use your own memcpy (since you're dealing with small amounts of data). However pointers do not work with true generics.

wj32
Then you clearly didn't write your code properly. Do you really think using bit-shifting (around 8 instructions for each Int32 at *least*) is faster than using a simple mov instruction? And I'm talking about pinning the buffer outside of the loop.
wj32
can you give me an example of this in c#? you're talking of the second Set method, aren't you? i just need the plain value types: Int16, UInt16, Int32, UInt32, Int64, UInt64
haze4real
fixed (byte* b = array) { for (...) *(int*)(b + offset) = value; } }. Exactly what you have in your Get method. Look at the Disassembly window if you don't believe that this is in fact the fastest method.
wj32
Ah, right; I had the wrong end of the stick. But the requirement is for both-endian. So you'd have to have a fallback for the other, and some mechanism of abstracting the two. I'd keep it simple and write the shifts. It also raises the question "Do I bother checking/supporting big-endian hardware" etc.
Marc Gravell
It shouldn't need a `for` loop, though (from comment)... unless I missed something? Just get the b + offset, swap to `int*` and assign?
Marc Gravell
Sadly C# doesn't offer a way to utilise the bswap instruction for big/little-endian compatibility, so in that case your solution would be the fastest. As for the for loop, it was meant to indicate that the pinning would be done outside of any loops.
wj32
Cool; I've added an update, but (as you note) endianness is still the problem here. I'm out of votes already today, but a virtual +1
Marc Gravell
yes big endian is needed a lot of times, most of the devices we communicate with use big endian order.
haze4real
+4  A: 

Using Marc Gravell's Set1 to Set4 and the Set5 below, I get the following numbers on my machine:

Set1: 197ms
Set2: 102ms
Set3: 604ms
Set4: 68ms
Set5: 55ms <==== pointer magic ;-p

Code:

unsafe static void Set5(byte[] target, int index, int value)
{
    fixed (byte* p = &target[index])
    {
        *((int*)p) = value;                
    }
}

Of course, it gets much faster when the byte array isn't pinned on each iteration but only once:

Set6: 10ms (little endian)
Set7: 85ms (big endian)

Code:

if (!BitConverter.IsLittleEndian)
{
    throw new NotSupportedException();
}

watch = Stopwatch.StartNew();
fixed (byte* p = buffer)
{
    for (int i = 0; i < LOOP; i++)
    {
        *((int*)(p + INDEX)) = VALUE;
    }
}
watch.Stop();
Console.WriteLine("Set6: " + watch.ElapsedMilliseconds + "ms");

watch = Stopwatch.StartNew();
fixed (byte* p = buffer)
{
    for (int i = 0; i < LOOP; i++)
    {
        *((int*)(p + INDEX)) = System.Net.IPAddress.HostToNetworkOrder(VALUE);
    }
}
watch.Stop();
Console.WriteLine("Set7: " + watch.ElapsedMilliseconds + "ms");
dtb
the problem here would be the overhead generated by endian swaping
haze4real
Cool; I've added an update, but endianness is still a problem here. I'm out of votes already today, but a virtual +1
Marc Gravell
@haze4real: the overhead produced by endian swapping is actually not that huge. Example updated.
dtb
:/ the overhead is huge its about 8 times slower in your test. You shouldn't pin the byte array outside of the loop in your test, cause its the function that needs to be tested... i can't pin it between these function calls
haze4real
True, but Set7 is still only a few milliseconds slower than Set4. My recommendation would be Marc's bit-shifting solution.
dtb
Yes but only because you pinned the array outside the loop, which just simulates an amount of function calls and isn't part of the solution cause the function itself gets called at different places in code.
haze4real
Right, you can only pin the array outside the loop if you want to modify the same array multiple times in a row. If this isn't the case in your application, you can't do this. But if you can, the performance gain is obvious.
dtb