I just fired up Reflector and peered into MultiCastDelegate.CombineImpl and saw some very lengthy code... every time two delegates are combined (read: every time you attached more than one event handler to an event), the following code gets run, which seems inefficient for such a fundamental performance critical feature. Does anyone know why it's been written this way?
[SecuritySafeCritical]
protected sealed override Delegate CombineImpl(Delegate follow)
{
object[] objArray;
int num2;
if (follow == null)
{
return this;
}
if (!Delegate.InternalEqualTypes(this, follow))
{
throw new ArgumentException(Environment.GetResourceString("Arg_DlgtTypeMis"));
}
MulticastDelegate o = (MulticastDelegate) follow;
int num = 1;
object[] objArray2 = o._invocationList as object[];
if (objArray2 != null)
{
num = (int) o._invocationCount;
}
object[] objArray3 = this._invocationList as object[];
if (objArray3 == null)
{
num2 = 1 + num;
objArray = new object[num2];
objArray[0] = this;
if (objArray2 == null)
{
objArray[1] = o;
}
else
{
for (int i = 0; i < num; i++)
{
objArray[1 + i] = objArray2[i];
}
}
return this.NewMulticastDelegate(objArray, num2);
}
int index = (int) this._invocationCount;
num2 = index + num;
objArray = null;
if (num2 <= objArray3.Length)
{
objArray = objArray3;
if (objArray2 == null)
{
if (!this.TrySetSlot(objArray, index, o))
{
objArray = null;
}
}
else
{
for (int j = 0; j < num; j++)
{
if (!this.TrySetSlot(objArray, index + j, objArray2[j]))
{
objArray = null;
break;
}
}
}
}
if (objArray == null)
{
int length = objArray3.Length;
while (length < num2)
{
length *= 2;
}
objArray = new object[length];
for (int k = 0; k < index; k++)
{
objArray[k] = objArray3[k];
}
if (objArray2 == null)
{
objArray[index] = o;
}
else
{
for (int m = 0; m < num; m++)
{
objArray[index + m] = objArray2[m];
}
}
}
return this.NewMulticastDelegate(objArray, num2, true);
}
Wouldn't a simple linked-list pattern have sufficed?
EDIT: My original question presupposes that the implementation is inefficient. With admirable tenacity, Hans and Jon (both well respected contributors here on SO), pointed out several facts about the suitability of the above implementation.
Hans points out that the use of arrays with TrySetSlot ends up being cache-friendly on multi-core CPUs (thus improving performance), and Jon kindly put together a microbenchmark which reveals that the above implementation yields very acceptable performance characteristics.