tags:

views:

316

answers:

4

I am chaining together 15 async operations through ports and receivers. This has left me very concerned with the interthread messaging time, specifically the time it takes between a task posting data to a port, and a new task begins processing that same data on a different thread. Assuming best case situation where each thread is idle at start, I have generated a test which uses the stop watch class to measure the time from two different dispatchers each operating at highest priority with a single thread.

What I found surprised me, my development rig is a Q6600 Quad Core 2.4 Ghz computer running Windows 7 x64, and the average context switch time from my test was 5.66 microseconds with a standard deviation of 5.738 microseconds, and a maximum of nearly 1.58 milliseconds (a factor of 282!). The Stopwatch Frequency is 427.7 nano seconds, so I am still well out of sensor noise.

What I would like to do is reduce the interthread messaging time as much as possible, and equally important, reduce the standard deviation of the context switch. I realize Windows is not a Real Time OS, and there are not guarantees, but the windows scheduler is a fair round robin priority based schedule, and the two threads in this test are both at the highest priority (the only threads that should be that high), so there should not be any context switches on the threads (evident by the 1.58 ms largest time... I believe windows quanta is 15.65 ms?) The only thing I can think of is variation in the timing of the OS calls to the locking mechanisms used by the CCR to pass messages between threads.

Please let me know if anyone else out there has measured interthread messaging time, and has any suggestions on how to improve it.

Here is the source code from my tests:

using System;
using System.Collections.Generic;
using System.IO;
using System.Threading;
using Microsoft.Ccr.Core;

using System.Diagnostics;

namespace Test.CCR.TestConsole
{
    class Program
    {
        static void Main(string[] args)
        {
            Console.WriteLine("Starting Timer");
            var sw = new Stopwatch();
            sw.Start();

            var dispatcher = new Dispatcher(1, ThreadPriority.Highest, true, "My Thread Pool");
            var dispQueue = new DispatcherQueue("Disp Queue", dispatcher);

            var sDispatcher = new Dispatcher(1, ThreadPriority.Highest, true, "Second Dispatcher");
            var sDispQueue = new DispatcherQueue("Second Queue", sDispatcher);

            var legAPort = new Port<EmptyValue>();
            var legBPort = new Port<TimeSpan>();

            var distances = new List<double>();

            long totalTicks = 0;

            while (sw.Elapsed.TotalMilliseconds < 5000) ;

            int runCnt = 100000;
            int offset = 1000;

            Arbiter.Activate(dispQueue, Arbiter.Receive(true, legAPort, i =>
                                                                            {
                                                                                TimeSpan sTime = sw.Elapsed;
                                                                                legBPort.Post(sTime);
                                                                            }));
            Arbiter.Activate(sDispQueue, Arbiter.Receive(true, legBPort, i =>
                                                                             {
                                                                                 TimeSpan eTime = sw.Elapsed;
                                                                                 TimeSpan dt = eTime.Subtract(i);
                                                                                 //if (distances.Count == 0 || Math.Abs(distances[distances.Count - 1] - dt.TotalMilliseconds) / distances[distances.Count - 1] > 0.1)
                                                                                 distances.Add(dt.TotalMilliseconds);

                                                                                 if(distances.Count > offset)
                                                                                 Interlocked.Add(ref totalTicks,
                                                                                                 dt.Ticks);
                                                                                 if(distances.Count < runCnt)
                                                                                     legAPort.Post(EmptyValue.SharedInstance);
                                                                             }));


            //Thread.Sleep(100);
            legAPort.Post(EmptyValue.SharedInstance);

            Thread.Sleep(500);

            while (distances.Count < runCnt)
                Thread.Sleep(25);

            TimeSpan exTime = TimeSpan.FromTicks(totalTicks);
            double exMS = exTime.TotalMilliseconds / (runCnt - offset);

            Console.WriteLine("Exchange Time: {0} Stopwatch Resolution: {1}", exMS, Stopwatch.Frequency);

            using(var stw = new StreamWriter("test.csv"))
            {
                for(int ix=0; ix < distances.Count; ix++)
                {
                    stw.WriteLine("{0},{1}", ix, distances[ix]);
                }
                stw.Flush();
            }

            Console.ReadKey();
        }
    }
}
A: 

ThreadPriority.Highest doesn't mean only the thread scheduler itself has a higher priority. The Win32 API has a more granular level of thread priority (clicky) with several levels above Highest (IIRC Highest is typically the highest priority non-admin code can be run at, admins can schedule higher priority as can any hardware drivers / kernel mode code) so there is no guarantee they will not be pre-empted.

Even if a thread is running with the highest priority windows can promote other threads above their base priority if they are holding resource locks that higher priority threads require which is another possibility why you may be suffering context switches.

Even then, as you say, Windows isn't a real time OS and it isn't guaranteed to honour thread priorities anyway.

PlausiblyDamp
A: 

To attack this problem a different way, do you need to have so many decoupled asynchronous operations? It may be useful to think about: vertically partitioning the work (asynchronously process numCores chunks of data end to end) instead of horizontally partitioning the work (as now, with each chunk of data processed in 15 decoupled stages); synchronously coupling some of your 15 stages to reduce the total to a smaller number.

The overhead of inter-thread communication will always be non-trivial. If some of your 15 operations are doing only a small chunk of work, the context switches will bite you.

Alex
+1  A: 

Windows is not a real-time OS. But you knew that already. What is killing you is the context switch times, not necessarily message times. You didn't really specify HOW your inter-processes communication works. If your really just running multiple threads, you'll find some gains by not using Windows message as a communication protocol, instead try rolling your own IPC using application hosted message queues instead.

The best average you can hope for is 1ms with any version of Windows when context switches occurs. Your probably seeing the 1ms times when your Application has to yield to the kernel. This is by design for Ring-1 applications (user-space). If it's absolutely critical that you get below 1ms you'll need to switch some of your application into Ring-0, which means writing a Device Driver.

Device Drivers don't suffer the same context-switch times that user apps do, and have access to nano-second resolution timers and sleep calls as well. If you do need to do this, the DDK (Device Driver Development Kit) is freely available from Microsoft, but I would HIGHLY recommend you invest in a 3rd party development kit. They usually have really good samples and lots of wizards to set things up right that would take you months of reading DDK documents to discover. You'll also want to get something like SoftIce because the normal Visual Studio debugger isn't going to help you debug Device Drivers.

keick
+2  A: 

Do the 15 asynchronous operations have to be asynchronous? i.e. are you forced to operate this way by a limitation of some library, or do you have the choice to make synchronous calls?

If you have the choice, you need to structure your application so that the use of asynchronicity is controlled by configuration parameters. The difference between asynchronous operations that return on a different thread vs. synchronous operations that return on the same thread should be transparent in the code. That way you can tune it without changing the code structure.

The phrase "embarrassingly parallel" describes an algorithm in which the majority of the work being done is obviously independent and so can be done in any order, making it easy to parallelise.

But you are "chaining together 15 async operations through ports and receivers". This could be described as "embarrassingly sequential". In other words, the same program could be logically written on a single thread. But then you'd lose any parallelism for the CPU-bound work occuring between the async operations (assuming there is any of significance).

If you write a simple test to cut out any significant CPU-bound work and just measure the context switching time, then guess what, you're going to be measuring the variation in the context switching time, as you've discovered.

The only reason for running multiple threads is because you have significant amounts of work for CPUs to do, and so you'd like to share it out between several CPUs. If the individual chunks of computation are short-lived enough, then context switching will be a significant overhead on any OS. By breaking your computation down into 15 stages, each very short, you are essentially asking the OS to do a lot of unnecessary context switching.

Daniel Earwicker