views:

70

answers:

1

I am profiling a C# application and it looks like two threads each calling Dictionary<>.ContainsKey() 5000 time each on two separate but identical dictionaries (with only two items) is twice as slow as one thread calling Dictionary<>.ContainsKey() on a single dictionary 10000 times.

I am measuring the "thread time" using a tool called JetBrains dotTrace. I am explicitly using copies of the same data, so there are no synhronization primitives that I am using. Is it possible that .NET is doing some synchronization behind the scenes?

I have a dual core machine, and there are three threads running: one is blocked using Semaphore.WaitAll() while the work is done on two new threads whose priority is set to ThreadPriority.Highest.

Obvious culprits like, not actually running the code in parallel, and not using a release build has been ruled out.

EDIT:

People want the code. Alright then:

    private int ReduceArrayIteration(VM vm, HeronValue[] input, int begin, int cnt)
    {
        if (cnt <= 1)
            return cnt;

        int cur = begin;

        for (int i=0; i < cnt - 1; i += 2)
        {
            // The next two calls are effectively dominated by a call 
            // to dictionary ContainsKey
            vm.SetVar(a, input[begin + i]);
            vm.SetVar(b, input[begin + i + 1]);
            input[cur++] = vm.Eval(expr);
        }

        if (cnt % 2 == 1)
        {
            input[cur++] = input[begin + cnt - 1];
        }

        int r = cur - begin;
        Debug.Assert(r >= 1);
        Debug.Assert(r < cnt);
        return r;
    }

    // From VM
    public void SetVar(string s, HeronValue o)
    {
        Debug.Assert(o != null);
        frames.Peek().SetVar(s, o);
    }

    // From Frame
    public bool SetVar(string s, HeronValue o)
    {
        for (int i = scopes.Count; i > 0; --i)
        {
            // Scope is a derived class of Dictionary
            Scope tbl = scopes[i - 1];
            if (tbl.HasName(s))
            {
                tbl[s] = o;
                return false;
            }
        }
        return false;
    }

Now here is the thread spawning code, which might be retarded:

public static class WorkSplitter
{
    static WaitHandle[] signals;

    public static void ThreadStarter(Object o)
    {
        Task task = o as Task;
        task.Run();
    }

    public static void SplitWork(List<Task> tasks)
    {
        signals = new WaitHandle[tasks.Count];
        for (int i = 0; i < tasks.Count; ++i)
            signals[i] = tasks[i].done;
        for (int i = 0; i < tasks.Count; ++i)
        {
            Thread t = new Thread(ThreadStarter);
            t.Priority = ThreadPriority.Highest;
            t.Start(tasks[i]);
        }
        Semaphore.WaitAll(signals);
    }        
}
+3  A: 

Even if there was any locking in Dictionary (there isn't), it could not affect your measurements since each thread is using a separate one. Running this test 10,000 times is not enough to get reliable timing data, ContainsKey() only takes 20 nanoseconds or so. You'll need at least several million times to avoid scheduling artifacts.

Hans Passant
The problem is that I need this to be fast on 10,000 iterations. Getting timings on more iterations, would be optimizing for a case that I don't care about. I am just having a devil of a time trying to fully utilize the second core for operations which I know theoretically should be parallelizable.
cdiggins
I don't see how you could expect parallel operation when the thread only executes for 200 microseconds. You'd have to be lucky to get both threads started at exactly the same time.
Hans Passant
The thread operates much longer than that, the whole test takes about 100 msec to run in wall-time without instrumentation. My profiler says that the second hottest hot-spot is "FindEntry" which is inside of Dictionary. It says 296 msec is spent here in the test with one thread, and 577 msec is spent in the test with two threads. This is really consistent in dozens of trials, and blows my mind.
cdiggins
In case anyone has some confusion when I say "msec" I am talking about "milliseconds" not "microseconds".
cdiggins
I doubt you'll get a good answer until you provide a code sample that's recognizable and runnable.
Hans Passant
Have you tried running the same test over many more iterations on the same threads? For example, starting up the single thread and running your 10k queries 100 times, and then starting up the two threads and running their 5k queries 100 times. This would give you a better idea of the throughput post-initialization. Also, is this actually a multi-core machine? If it's not, then I would fully expect this result.
spurserh
Well turns out that you are right. I had convinced myself it wouldn't change anything but running the tight code in an external loop 10x, showed only a 50% increase in overall time. Thanks for everyone's help!
cdiggins