views:

112

answers:

3

I've written a 'server' program that writes to shared memory, and a client program that reads from the memory. The server has different 'channels' that it can be writing to, which are just different linked lists that it's appending items too. The client is interested in some of the linked lists, and wants to read every node that's added to those lists as it comes in, with the minimum latency possible.

I have 2 approaches for the client:

  1. For each linked list, the client keeps a 'bookmark' pointer to keep its place within the linked list. It round robins the linked lists, iterating through all of them over and over (it loops forever), moving each bookmark one node forward each time if it can. Whether it can is determined by the value of a 'next' member of the node. If it's non-null, then jumping to the next node is safe (the server switches it from null to non-null atomically). This approach works OK, but if there are a lot of lists to iterate over, and only a few of them are receiving updates, the latency gets bad.

  2. The server gives each list a unique ID. Each time the server appends an item to a list, it also appends the ID number of the list to a master 'update list'. The client only keeps one bookmark, a bookmark into the update list. It endlessly checks if the bookmark's next pointer is non-null ( while(node->next_ == NULL) {} ), if so moves ahead, reads the ID given, and then processes the new node on the linked list that has that ID. This, in theory, should handle large numbers of lists much better, because the client doesn't have to iterate over all of them each time.

When I benchmarked the latency of both approaches (using gettimeofday), to my surprise #2 was terrible. The first approach, for a small number of linked lists, would often be under 20us of latency. The second approach would have small spats of low latencies but often be between 4,000-7,000us!

Through inserting gettimeofday's here and there, I've determined that all of the added latency in approach #2 is spent in the loop repeatedly checking if the next pointer is non-null. This is puzzling to me; it's as if the change in one process is taking longer to 'publish' to the second process with the second approach. I assume there's some sort of cache interaction going on I don't understand. What's going on?

Update: Originally, approach #2 used a condition variable, so that if node->next_ == NULL it would wait on the condition, and the server would notify on the condition everytime it issued an update. The latency was the same, and in trying to figure out why I reduced the code down to the approach above. I'm running on a multicore machine, so one process spinlocking shouldn't affect the other.

Update 2: node->next_ is volatile.

A: 

You are doing a Spin Lock in #2, which is generally not such a great idea, and is chewing up cycles.

MadKeithV
Spinlocks outperform signals if the expected wait-time is very low and locking happens often.
Space_C0wb0y
I'm running on a multicore machine, so the server and client aren't stealing cycles from one another.
Joseph Garvin
A: 

Have you tried adding a yield after each failed polling-attempt in your second approach? Just a guess, but it may reduce the power-looping.

With Boost.Thread this would look like this:

while(node->next_ == NULL) {
    boost::this_thread::yield( );
}
Space_C0wb0y
I could, but how would this help? The problem is latency, not CPU usage. I'm running on a multicore machine, so the server and client aren't stealing cycles from one another.
Joseph Garvin
+1  A: 

Since it sounds like reads and writes are occurring on separate CPUs, perhaps a memory barrier would help? Your writes may not be occurring when you expect them to be.

zdan
Memory barriers will control the *order* of the writes, which can make them take longer, but I don't think they ever make a write appear to another CPU faster. I've tried putting GCC's __sync_synchronize on both the writing and reading side to no effect.
Joseph Garvin