views:

234

answers:

3

I'm writing different implementations of immutable binary trees in C#, and I wanted my trees to inherit some common methods from a base class.

Unfortunately, classes which derive from the base class are abysmally slow. Non-derived classes perform adequately. Here are two nearly identical implementations of an AVL tree to demonstrate:

The two trees have the exact same code, but I've moved the DerivedAvlTree.Insert method in base class. Here's a test app:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using Juliet.Collections.Immutable;

namespace ConsoleApplication1
{
    class Program
    {
        const int VALUE_COUNT = 5000;

        static void Main(string[] args)
        {
            var avlTreeTimes = TimeIt(TestAvlTree);
            var derivedAvlTreeTimes = TimeIt(TestDerivedAvlTree);

            Console.WriteLine("avlTreeTimes: {0}, derivedAvlTreeTimes: {1}", avlTreeTimes, derivedAvlTreeTimes);
        }

        static double TimeIt(Func<int, int> f)
        {
            var seeds = new int[] { 314159265, 271828183, 231406926, 141421356, 161803399, 266514414, 15485867, 122949829, 198491329, 42 };

            var times = new List<double>();

            foreach (int seed in seeds)
            {
                var sw = Stopwatch.StartNew();
                f(seed);
                sw.Stop();
                times.Add(sw.Elapsed.TotalMilliseconds);
            }

            // throwing away top and bottom results
            times.Sort();
            times.RemoveAt(0);
            times.RemoveAt(times.Count - 1);
            return times.Average();
        }

        static int TestAvlTree(int seed)
        {
            var rnd = new System.Random(seed);

            var avlTree = AvlTree<double>.Create((x, y) => x.CompareTo(y));
            for (int i = 0; i < VALUE_COUNT; i++)
            {
                avlTree = avlTree.Insert(rnd.NextDouble());
            }

            return avlTree.Count;
        }

        static int TestDerivedAvlTree(int seed)
        {
            var rnd = new System.Random(seed);

            var avlTree2 = DerivedAvlTree<double>.Create((x, y) => x.CompareTo(y));
            for (int i = 0; i < VALUE_COUNT; i++)
            {
                avlTree2 = avlTree2.Insert(rnd.NextDouble());
            }

            return avlTree2.Count;
        }
    }
}
  • AvlTree: inserts 5000 items in 121 ms
  • DerivedAvlTree: inserts 5000 items in 2182 ms

My profiler indicates that the program spends an inordinate amount of time in BaseBinaryTree.Insert. Anyone whose interested can see the EQATEC log file I've created with the code above (you'll need EQATEC profiler to make sense of file).

I really want to use a common base class for all of my binary trees, but I can't do that if performance will suffer.

What causes my DerivedAvlTree to perform so badly, and what can I do to fix it?

A: 

It's not anything to do with the derived class calling the original implementation and then also calling Balance, is it?

I think you'll probably need to look at the generated machine code to see what's different. All I can see from the source code is that you've changed a lot of static methods into virtual methods called polymorphically... in the first case the JIT knows exactly what method will be called and can do a direct call instruction, possibly even inline. But with a polymorphic call it has no choice but to do a v-table lookup and indirect call. That lookup represents a significant fraction of the work being done.

Life might get a little better if you call ((TreeType)this).Method() instead of this.Method(), but likely you can't remove the polymorphic call unless you also declare the overriding methods as sealed. And even then, you might pay the penalty of a runtime check on the this instance.

Putting your reusable code into generic static methods in the base class might help somewhat as well, but I think you're still going to be paying for polymorphic calls. C# generics just don't optimize as well as C++ templates.

Ben Voigt
"It's not anything to do with the derived class calling the original implementation and then also calling Balance, is it?" - according to my profiler, bother AvlTree and DerivedAvlTree make the same number of calls to Balance and spend about the same amount of time in the Balance method, so that's probably negligible. When I look at the output of both classes in Reflector, they are *very* similar, but subtly different at the same time: http://pastebin.com/8YpNmbW2 . Its hard to see anything in that code which screams "its slow over here!"
Juliet
Reflector doesn't show you the machine code, but the Visual Studio debugger will (you might have to turn on native debugging, then right-click and choose "Go To Assembly" when stopped at a breakpoint inside the Insert method). Most optimization in C# is done by the JIT during the MSIL -> machine code conversion.
Ben Voigt
+15  A: 

Note - there's now a "clean" solution here, so skip to the final edit if you only want a version that runs fast and don't care about all of the detective work.

It doesn't seem to be the difference between direct and virtual calls that's causing the slowdown. It's something to do with those delegates; I can't quite explain specifically what it is, but a look at the generated IL is showing a lot of cached delegates which I think might not be getting used in the base class version. But the IL itself doesn't seem to be significantly different between the two versions, which leads me to believe that the jitter itself is partly responsible.

Take a look at this refactoring, which cuts the running time by about 60%:

public virtual TreeType Insert(T value)
{
    Func<TreeType, T, TreeType, TreeType> nodeFunc = (l, x, r) =>
    {
        int compare = this.Comparer(value, x);
        if (compare < 0) { return CreateNode(l.Insert(value), x, r); }
        else if (compare > 0) { return CreateNode(l, x, r.Insert(value)); }
        return Self();
    };
    return Insert<TreeType>(value, nodeFunc);
}

private TreeType Insert<U>(T value, 
    Func<TreeType, T, TreeType, TreeType> nodeFunc)
{
    return this.Match<TreeType>(
        () => CreateNode(Self(), value, Self()),
        nodeFunc);
}

This should (and apparently does) ensure that the insertion delegate is only being created once per insert - it's not getting created on each recursion. On my machine it cuts the running time from 350 ms to 120 ms (by contrast, the single-class version runs in about 30 ms, so this is still nowhere near where it should be).

But here's where it gets even weirder - after trying the above refactoring, I figured, hmm, maybe it's still slow because I only did half the work. So I tried materializing the first delegate as well:

public virtual TreeType Insert(T value)
{
    Func<TreeType> nilFunc = () => CreateNode(Self(), value, Self());
    Func<TreeType, T, TreeType, TreeType> nodeFunc = (l, x, r) =>
    {
        int compare = this.Comparer(value, x);
        if (compare < 0) { return CreateNode(l.Insert(value), x, r); }
        else if (compare > 0) { return CreateNode(l, x, r.Insert(value)); }
        return Self();
    };
    return Insert<TreeType>(value, nilFunc, nodeFunc);
}

private TreeType Insert<U>(T value, Func<TreeType> nilFunc,
    Func<TreeType, T, TreeType, TreeType> nodeFunc)
{
    return this.Match<TreeType>(nilFunc, nodeFunc);
}

And guess what... this made it slower again! With this version, on my machine, it took a little over 250 ms on this run.

This defies all logical explanations that might relate the issue to the compiled bytecode, which is why I suspect that the jitter is in on this conspiracy. I think the first "optimization" above might be (WARNING - speculation ahead) allowing that insertion delegate to be inlined - it's a known fact that the jitter can't inline virtual calls - but there's still something else that's not being inlined and that's where I'm presently stumped.

My next step would be to selectively disable inlining on certain methods via the MethodImplAttribute and see what effect that has on the runtime - that would help to prove or disprove this theory.

I know this isn't a complete answer but hopefully it at least gives you something to work with, and maybe some further experimentation with this decomposition can produce results that are close in performance to the original version.


Edit: Hah, right after I submitted this I stumbled on another optimization. If you add this method to the base class:

private TreeType CreateNilNode(T value)
{
    return CreateNode(Self(), value, Self());
}

Now the running time drops to 38 ms here, just barely above the original version. This blows my mind, because nothing actually references this method! The private Insert<U> method is still identical to the very first code block in my answer. I was going to change the first argument to reference the CreateNilNode method, but I didn't have to. Either the jitter is seeing that the anonymous delegate is the same as the CreateNilNode method and sharing the body (probably inlining again), or... or, I don't know. This is the first instance I've ever witnessed where adding a private method and never calling it can speed up a program by a factor of 4.

You'll have to check this to make sure I haven't accidentally introduced any logic errors - pretty sure I haven't, the code is almost the same - but if it all checks out, then here you are, this runs almost as fast as the non-derived AvlTree.


FURTHER UPDATE

I was able to come up with a version of the base/derived combination that actually runs slightly faster than the single-class version. Took some coaxing, but it works!

What we need to do is create a dedicated inserter that can create all of the delegates just once, without needing to do any variable capturing. Instead, all of the state is stored in member fields. Put this inside the BaseBinaryTree class:

protected class Inserter
{
    private TreeType tree;
    private Func<TreeType> nilFunc;
    private Func<TreeType, T, TreeType, TreeType> nodeFunc;
    private T value;

    public Inserter(T value)
    {
        this.nilFunc = () => CreateNode();
        this.nodeFunc = (l, x, r) => PerformMatch(l, x, r);
        this.value = value;
    }

    public TreeType Insert(TreeType parent)
    {
        this.tree = parent;
        return tree.Match<TreeType>(nilFunc, nodeFunc);
    }

    private TreeType CreateNode()
    {
        return tree.CreateNode(tree, value, tree);
    }

    private TreeType PerformMatch(TreeType l, T x, TreeType r)
    {
        int compare = tree.Comparer(value, x);
        if (compare < 0) { return tree.CreateNode(l.Insert(value, this), x, r); }
        else if (compare > 0) { return tree.CreateNode(l, x, r.Insert(value, this)); }
        return tree;
    }
}

Yes, yes, I know, it's very un-functional using that mutable internal tree state, but remember that this isn't the tree itself, it's just a throwaway "runnable" instance. Nobody ever said that perf-opt was pretty! This is the only way to avoid creating a new Inserter for each recursive call, which would otherwise slow this down on account of all the new allocations of the Inserter and its internal delegates.

Now replace the insertion methods of the base class to this:

public TreeType Insert(T value)
{
    return Insert(value, null);
}

protected virtual TreeType Insert(T value, Inserter inserter)
{
    if (inserter == null)
    {
        inserter = new Inserter(value);
    }
    return inserter.Insert(Self());
}

I've made the public Insert method non-virtual; all of the real work is delegated to a protected method that takes (or creates its own) Inserter instance. Altering the derived class is simple enough, just replace the overridden Insert method with this:

protected override DerivedAvlTree<T> Insert(T value, Inserter inserter)
{
    return base.Insert(value, inserter).Balance();
}

That's it. Now run this. It will take almost the exact same amount of time as the AvlTree, usually a few milliseconds less in a release build.

The slowdown is clearly due to some specific combination of virtual methods, anonymous methods and variable capturing that's somehow preventing the jitter from making an important optimization. I'm not so sure that it's inlining anymore, it might just be caching the delegates, but I think the only people who could really elaborate are the jitter folks themselves.

Aaronaught
By the way, you might notice that the `U` type parameter in the private `Insert` method doesn't seem to be doing anything... just *try* taking it out and watch the whole thing run dog slow again. Why does adding a redundant generic type parameter to a private method make it run faster? Your guess is as good as mine!
Aaronaught
That's pretty nuts. :p
Tanzelax
For what its worth, this is a pretty fun puzzle in generics, inheritance, and functional programming -- but I just don't like having to work around Microsoft's voodoo optimizers, and I especially don't like how the workaround isn't transparent to derived classes. I think the only solution here is to ditch the delegates and pattern matching constructs -- C# was just never intended for that style of programming. But thank you anyway, I really appreciate it! :)
Juliet
@Juliet: Is there a reason why you haven't awarded Aaronaught the "answer" check yet? It seems like he went above and beyond the call of duty here. :)
Pretzel
@Juliet: It's definitely straining the limits of a primarily object-oriented language, especially considering there's already an OO solution to the problem (visitor). But then again, these are language-agnostic issues you deal with when you have jitted code that's CPU-bound. You don't have to totally ditch the functional model, just do what I did a while back and hack up an `AnonymousVisitor`, which works much the same as our `Inserter` here but takes its delegates in the constructor.
Aaronaught
It's also completely possible to write the code above without leaking the `Inserter` to the derived tree, just a little trickier to ensure thread-safety.
Aaronaught
+answer: your sleuthing is much appreciated, Aaronaught :)
Juliet
A: 

You're running under the VS IDE, right? It's taking about 20 times longer, right?

Wrap a loop around it to iterate it 10 times, so the long version takes 20 seconds. Then while it is running, hit the "pause" button, and look at the call stack. You will see exactly what the problem is with 95% certainty. If you don't believe what you see, try it a few more times. Why does it work? Here's the long explanation, and here's the short one.

Mike Dunlavey