views:

44

answers:

2

I am building latency measurement into a communication middleware I am building. The way I have it working is that I periodically send a probe msg from my publishing apps. Subscribing apps receive this probe, cache it, and send an echo back at a time of their choosing, noting how much time the msg was kept “on hold”. The subscribing app receives these echos and calculates latency as (now() – time_sent – time_on_hold) / 2.

This kinda works, but the numbers are vastly different (3x) when “time on hold” is greater than 0. I.e if I echo the msg back immediately I get around 50us on my dev env and if I wait, then send the msg back the time jumps to 150us (though I discount whatever time I was on hold). I use QueryPerfomanceCounter for all measurements.

This is all inside a single Windows 7 box. What am I missing here?

TIA.

A: 

Ok, I have edited my answer to reflect your answer: Sorry for the delay, but I didn't notice that you had elaborated on the question by creating an answer.

It's seems that functionally you are doing nothing wrong.

I think that when you distribute your application outside of localhost conditions, the additional 100us (if it is indeed roughly constant) will pale into insignificance compared to the average latency of a functioning network.

For the purposes of answering your question am thinking that there is a thread/interrupt scheduling issue on the server side that needs to be investigated, as you do not seem to be doing anything on the client that is not accounted for.

Try the following test scenario:

  • Send two Probes to clients A and B. (all localhost)
  • Send the Probe to 'Client B' one second (or X/2 seconds) after you send the probe to Client A.
  • Ensuring that 'Client A' waits for two seconds (or X seconds) and 'Client B' waits 'one second (or X/2 seconds)

The idea being that hopefully, both clients will send back their probe answers at roughly the same time and both after a sleep/wait (performing the action that exposes the problem). The objective is to try to get one of the clients responses to 'wake up' the publisher to see if the next clients answer will be processed immediately.

If one of these returned probes is not showing the anomily (most likely the second response) it could point to the fact that the publisher thread is waking from a sleep cycle (on recv 1st responce) and is immediately available to process the second response.

Again, if it turns out that the 100us delay is roughly constant, it will be +-10% of 1ms which is the timeframe appropriate for realworld network conditions.

Adrian Regan
I am not sure I follow your reasoning. Even if the responding thread is put to sleep, I measure the time between first getting the probe MSG and sending the reply and subtract that time from the total. I am using Upd so there is no connection setup/teardown.
samwise
How are you calculating 'now()' on the publishing side?
Adrian Regan
A: 

A bit more information. I am using the following to measure time:

   static long long timeFreq;

   static struct Init
   {
      Init()
      {
         QueryPerformanceFrequency((LARGE_INTEGER*) &timeFreq);
      }
   } init;

   long long OS::now()
   {
      long long result;
      QueryPerformanceCounter((LARGE_INTEGER*)&result);
      return result;
   }

   double OS::secondsDiff(long long ts1, long long ts2)
   {
      return (double) (ts1-ts2)/timeFreq;
   }

On the publish side I do something like:

Probe p;
p.sentTimeStamp = OS::now();
send(p);
Response r = recv();
latency=OS::secondsDiff(OS::now()- r.sentTimeStamp) - r.secondsOnHoldOnReceiver;

And on the receiver side:

Probe p = recv();
long long received = OS::now();
sleep();
Response r;
r.sentTimeStamp = p.timeStamp;
r.secondOnHoldOnReceiver = OS::secondsDiff(OS::now(), received);
send(r);
samwise
See my answer below.
Adrian Regan