views:

52

answers:

2

I just fired up Reflector and peered into MultiCastDelegate.CombineImpl and saw some very lengthy code... every time two delegates are combined (read: every time you attached more than one event handler to an event), the following code gets run, which seems inefficient for such a fundamental performance critical feature. Does anyone know why it's been written this way?

[SecuritySafeCritical]
protected sealed override Delegate CombineImpl(Delegate follow)
{
    object[] objArray;
    int num2;
    if (follow == null)
    {
        return this;
    }
    if (!Delegate.InternalEqualTypes(this, follow))
    {
        throw new ArgumentException(Environment.GetResourceString("Arg_DlgtTypeMis"));
    }
    MulticastDelegate o = (MulticastDelegate) follow;
    int num = 1;
    object[] objArray2 = o._invocationList as object[];
    if (objArray2 != null)
    {
        num = (int) o._invocationCount;
    }
    object[] objArray3 = this._invocationList as object[];
    if (objArray3 == null)
    {
        num2 = 1 + num;
        objArray = new object[num2];
        objArray[0] = this;
        if (objArray2 == null)
        {
            objArray[1] = o;
        }
        else
        {
            for (int i = 0; i < num; i++)
            {
                objArray[1 + i] = objArray2[i];
            }
        }
        return this.NewMulticastDelegate(objArray, num2);
    }
    int index = (int) this._invocationCount;
    num2 = index + num;
    objArray = null;
    if (num2 <= objArray3.Length)
    {
        objArray = objArray3;
        if (objArray2 == null)
        {
            if (!this.TrySetSlot(objArray, index, o))
            {
                objArray = null;
            }
        }
        else
        {
            for (int j = 0; j < num; j++)
            {
                if (!this.TrySetSlot(objArray, index + j, objArray2[j]))
                {
                    objArray = null;
                    break;
                }
            }
        }
    }
    if (objArray == null)
    {
        int length = objArray3.Length;
        while (length < num2)
        {
            length *= 2;
        }
        objArray = new object[length];
        for (int k = 0; k < index; k++)
        {
            objArray[k] = objArray3[k];
        }
        if (objArray2 == null)
        {
            objArray[index] = o;
        }
        else
        {
            for (int m = 0; m < num; m++)
            {
                objArray[index + m] = objArray2[m];
            }
        }
    }
    return this.NewMulticastDelegate(objArray, num2, true);
}

Wouldn't a simple linked-list pattern have sufficed?


EDIT: My original question presupposes that the implementation is inefficient. With admirable tenacity, Hans and Jon (both well respected contributors here on SO), pointed out several facts about the suitability of the above implementation.

Hans points out that the use of arrays with TrySetSlot ends up being cache-friendly on multi-core CPUs (thus improving performance), and Jon kindly put together a microbenchmark which reveals that the above implementation yields very acceptable performance characteristics.

+1  A: 

which seems extremely inefficient for such a fundamental performance critical feature

Just how often do you think delegates are attached to events (or combined at other times)?

For example, in a Windows Forms app this is likely to happen pretty rarely - basically when setting up a form, for the most part... at which point there are far heavier things going on than what's in MulticastDelegate.CombineImpl.

What does happen very often is that delegates are invoked... for example, for every item in every projection or predicate (etc) in a LINQ query. That's the really performance critical bit, IMO.

I'm also not convinced that this code is as inefficient as you think it is. It's taking the same approach as ArrayList in terms of creating a larger array than needed, to fill it as it's required. Would a linked list be more efficient? Possibly in some terms - but equally it would be less efficient in terms of locality and levels of indirection. (As each node would need to be a new object which itself contained a reference to the delegate, so navigating the list could end up bringing more pages into memory than an array of references would.)

EDIT: Just as a quick microbenchmark (with all the usual caveats) here's some code to perform a given number of iterations of combining a given number of delegates:

using System;
using System.Diagnostics;

class Test
{
    const int Iterations = 10000000;
    const int Combinations = 3;

    static void Main()
    {
        // Make sure all paths are JITted
        Stopwatch sw = Stopwatch.StartNew();
        sw.Stop();
        Action tmp = null;
        for (int j = 0; j < Combinations; j++)
        {
            tmp += Foo;
        }

        sw = Stopwatch.StartNew();
        for (int i = 0; i < Iterations; i++)
        {
            Action action = null;
            for (int j = 0; j < Combinations; j++)
            {
                action += Foo;
            }
        }
        sw.Stop();
        Console.WriteLine(sw.ElapsedMilliseconds);
    }

    static void Foo()
    {
    }
}

Some results on my machine, all with 10,000,000 iterations:

5 delegates: about 5.8 seconds
4 delegates: about 4.3 seconds
3 delegates: about 3.2 seconds
2 delegates: about 1.4 seconds
1 delegate: about 160ms

(All tests run multiple times; the above are just samples which seemed reasonably representative. I haven't taken the average or anything.)

Given the above results, I suspect that any paths even in combination-heavy WPF which only attach a single delegate will be blazingly fast. They'll slow down significantly going from 1-2, then gradually degrade (but with a lot less proportional difference than the 1-2 case).

Jon Skeet
Great answer Jon. When looking at technologies like WPF data binding, I'm unconvinced that event attach/detach is something which happens *rarely*. If you profile a grid, you'll see WPF data binding goes ballistic with attaching to every object down every property path... a common scenario where the attach/detach is more performance critical than the actual invoke.
Mark
@Mark: Your comment wasn't exactly clear to me in terms of what you mean by "every property path" - could you give more details? For example, in a 20x30 grid, how many attach/detach occurrences will occur - and will they be one-off, or happening continuously over program execution? Have you timed how long it takes to execute that many combine/remove operations? (I'm just going to time some now, and will edit my answer with the results.)
Jon Skeet
Mark
+1  A: 

It's the exact opposite, this code was optimized by not using List<> or similar collection object. Everything that List does is inlined here. The added advantage is that the locking is cheap (TrySetSlot uses Interlocked.CompareExchange) and saving the cost of dragging around a locking object. Explicitly inlining code like this rather than leaving it up to the JIT compiler isn't that common in the .NET framework. But exceptions are made for low-level primitives like this.

Hans Passant
Good feedback - thanks. Do you know why the internals are handled using arrays instead of linked lists? Given how frequently technologies like WPF dat grids attach and detach events, I'd have thought that a linked list would be more performance friendly, no?
Mark
@Mark - Linked lists perform *very* poorly on modern CPU cores due to their poor cache locality.
Hans Passant
@Hans: +10000 (if I could) for that comment... I had myopically overlooked the fact that CPU cache and locality plays a part in performance. Please strike my short-sightedness form the record :)
Mark
@Hans: Jon's little benchmark shows an interresting jump as more delegates are added to an event... combining 5 delegates takes 36 times longer than attaching just 1. Would linked lists have performed worse than this?
Mark
@Mark - no, it is roughly linear. The case of *one* target is treated specially. Check the _methodPtr and _target fields in the Delegate class, the base class for MulticastDelegate. They store the single target, _invocationList isn't used. Yet another optimization.
Hans Passant