views:

315

answers:

4

I have a 'server' program that updates many linked lists in shared memory in response to external events. I want client programs to notice an update on any of the lists as quickly as possible (lowest latency). The server marks a linked list's node's state_ as FILLED once its data is filled in and its next pointer has been set to a valid location. Until then, its state_ is NOT_FILLED_YET. I am using memory barriers to make sure that clients don't see the state_ as FILLED before the data within is actually ready (and it seems to work, I never see corrupt data). Also, state_ is volatile to be sure the compiler doesn't lift the client's checking of it out of loops.

Keeping the server code exactly the same, I've come up with 3 different methods for the client to scan the linked lists for changes. The question is: Why is the 3rd method fastest?

Method 1: Round robin over all the linked lists (called 'channels') continuously, looking to see if any nodes have changed to 'FILLED':

void method_one()
{
    std::vector<Data*> channel_cursors;
    for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
    {
        Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
        channel_cursors.push_back(current_item);
    }

    while(true)
    {
        for(std::size_t i = 0; i < channel_list.size(); ++i)
        {   
            Data* current_item = channel_cursors[i];

            ACQUIRE_MEMORY_BARRIER;
            if(current_item->state_ == NOT_FILLED_YET) {
                continue;
            }

            log_latency(current_item->tv_sec_, current_item->tv_usec_);

            channel_cursors[i] = static_cast<Data*>(current_item->next_.get(segment));
        }
    }
}

Method 1 gave very low latency when then number of channels was small. But when the number of channels grew (250K+) it became very slow because of looping over all the channels. So I tried...

Method 2: Give each linked list an ID. Keep a separate 'update list' to the side. Every time one of the linked lists is updated, push its ID on to the update list. Now we just need to monitor the single update list, and check the IDs we get from it.

void method_two()
{
    std::vector<Data*> channel_cursors;
    for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
    {
        Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
        channel_cursors.push_back(current_item);
    }

    UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));

    while(true)
    {   
            ACQUIRE_MEMORY_BARRIER;
        if(update_cursor->state_ == NOT_FILLED_YET) {
            continue;
        }

        ::uint32_t update_id = update_cursor->list_id_;

        Data* current_item = channel_cursors[update_id];

        if(current_item->state_ == NOT_FILLED_YET) {
            std::cerr << "This should never print." << std::endl; // it doesn't
            continue;
        }

        log_latency(current_item->tv_sec_, current_item->tv_usec_);

        channel_cursors[update_id] = static_cast<Data*>(current_item->next_.get(segment));
        update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
    }   
}

Method 2 gave TERRIBLE latency. Whereas Method 1 might give under 10us latency, Method 2 would inexplicably often given 8ms latency! Using gettimeofday it appears that the change in update_cursor->state_ was very slow to propogate from the server's view to the client's (I'm on a multicore box, so I assume the delay is due to cache). So I tried a hybrid approach...

Method 3: Keep the update list. But loop over all the channels continuously, and within each iteration check if the update list has updated. If it has, go with the number pushed onto it. If it hasn't, check the channel we've currently iterated to.

void method_three()
{
    std::vector<Data*> channel_cursors;
    for(ChannelList::iterator i = channel_list.begin(); i != channel_list.end(); ++i)
    {
        Data* current_item = static_cast<Data*>(i->get(segment)->tail_.get(segment));
        channel_cursors.push_back(current_item);
    }

    UpdateID* update_cursor = static_cast<UpdateID*>(update_channel.tail_.get(segment));

    while(true)
    {
        for(std::size_t i = 0; i < channel_list.size(); ++i)
        {
            std::size_t idx = i;

            ACQUIRE_MEMORY_BARRIER;
            if(update_cursor->state_ != NOT_FILLED_YET) {
                //std::cerr << "Found via update" << std::endl;
                i--;
                idx = update_cursor->list_id_;
                update_cursor = static_cast<UpdateID*>(update_cursor->next_.get(segment));
            }

            Data* current_item = channel_cursors[idx];

            ACQUIRE_MEMORY_BARRIER;
            if(current_item->state_ == NOT_FILLED_YET) {
                continue;
            }

            found_an_update = true;

            log_latency(current_item->tv_sec_, current_item->tv_usec_);
            channel_cursors[idx] = static_cast<Data*>(current_item->next_.get(segment));
        }
    }
}

The latency of this method was as good as Method 1, but scaled to large numbers of channels. The problem is, I have no clue why. Just to throw a wrench in things: if I uncomment the 'found via update' part, it prints between EVERY LATENCY LOG MESSAGE. Which means things are only ever found on the update list! So I don't understand how this method can be faster than method 2.

The full, compilable code (requires GCC and boost-1.41) that generates random strings as test data is at: http://pastebin.com/0kuzm3Uf

Update: All 3 methods are effectively spinlocking until an update occurs. The difference is in how long it takes them to notice the update has occurred. They all continuously tax the processor, so that doesn't explain the speed difference. I'm testing on a 4-core machine with nothing else running, so the server and the client have nothing to compete with. I've even made a version of the code where updates signal a condition and have clients wait on the condition -- it didn't help the latency of any of the methods.

Update2: Despite there being 3 methods, I've only tried 1 at a time, so only 1 server and 1 client are competing for the state_ member.

A: 

I've noticed in both method 1 and method 3 you have a line, ACQUIRE_MEMORY_BARRIER, which I assume has something to do with multi-threading/race conditions?

Either way, method 2 doesn't have any sleeps which means the following code...

while(true)
{   
    if(update_cursor->state_ == NOT_FILLED_YET) {
        continue;
    }

is going to hammer the processor. The typical way to do this kind of producer/consumer task is to use some kind of semaphore to signal to the reader that the update list has changed. A search for producer/consumer multi threading should give you a large number of examples. The main idea here is that this allows the thread to go to sleep while it's waiting for the update_cursor->state to change. This prevents this thread from stealing all the cpu cycles.

Pace
Yes, it's for avoiding race conditions. It was supposed to be in method 2 as well, I've edited my post to fix that (timings are the same). Modern processors (not just the compiler) are free to reorder stores and loads across threads, so one thread might set A then B, but another thread may 'see' that B was set before it sees that A was set. The server fills the node's data, does a RELEASE_MEMORY_BARRIER, then sets state_ to FILLED. Combined with the client doing ACQUIRE_MEMORY_BARRIER, this ensures that the client never perceives state_ as FILLED before the node's data is filled in.
Joseph Garvin
I should add that this is my first time using memory barriers, and that's how I *think* they're supposed to work ;) If you follow the pastebin link, the first file, list.H defines RELEASE_MEMORY_BARRIER and ACQUIRE_MEMORY_BARRIER to primitives documented here: http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
Joseph Garvin
Sorry to spam ;p I've edited my post (see the bottom) to address the hammering of the processor.
Joseph Garvin
+1  A: 

Hypothesis: Method 2 is somehow blocking the update from getting written by the server.

One of the things you can hammer, besides the processor cores themselves, is your coherent cache. When you read a value on a given core, the L1 cache on that core has to acquire read access to that cache line, which means it needs to invalidate the write access to that line that any other cache has. And vice versa to write a value. So this means that you're continually ping-ponging the cache line back and forth between a "write" state (on the server-core's cache) and a "read" state (in the caches of all the client cores).

The intricacies of x86 cache performance are not something I am entirely familiar with, but it seems entirely plausible (at least in theory) that what you're doing by having three different threads hammering this one memory location as hard as they can with read-access requests is approximately creating a denial-of-service attack on the server preventing it from writing to that cache line for a few milliseconds on occasion.

You may be able to do an experiment to detect this by looking at how long it takes for the server to actually write the value into the update list, and see if there's a delay there corresponding to the latency.

You might also be able to try an experiment of removing cache from the equation, by running everything on a single core so the client and server threads are pulling things out of the same L1 cache.

Brooks Moses
I put gettimeofday's around the write in the server, and it always takes 0 or 1 microseconds. But is it possible that the write finishes faster from the server's perspective than it does to the client for the reason you mentioned? i.e. is it plausible the processor is queueing up the write then 'returning' immediately? I'll try running it all on one core tomorrow. Also, I'm only trying 1 method at a time, so there is only the 1 server and 1 client competing.
Joseph Garvin
That seems possible; I don't know enough about x86 caches to say for certain one way or the other -- or, for that matter, instruction reordering within the processor core. I'm handwaving here, a bit. (Also, somewhere I had the idea that you had multiple clients running in parallel for one server, I guess because you mentioned 4 cores. That doesn't really change things much, though.)
Brooks Moses
I think Herb Sutter had written something about those multi-cores cache system. Is it possible for you to precise on which core you'd like to execute each thread ?
Matthieu M.
+1  A: 

I don't know if you have ever read the Concurrency columns from Herb Sutter. They are quite interesting, especially when you get into the cache issues.

Indeed the Method2 seems better here because the id being smaller than the data in general would mean that you don't have to do round-trips to the main memory too often (which is taxing).

However, what can actually happen is that you have such a line of cache:

Line of cache = [ID1, ID2, ID3, ID4, ...]
                  ^         ^
            client          server

Which then creates contention.

Here is Herb Sutter's article: Eliminate False Sharing. The basic idea is simply to artificially inflate your ID in the list so that it occupies one line of cache entirely.

Check out the other articles in the serie while you're at it. Perhaps you'll get some ideas. There's a nice lock-free circular buffer I think that could help for your update list :)

Matthieu M.
Interesting I thought of the cache issue before but figured I'd be somewhat automatically protected from it by virtue of using a linked list as opposed to an array. I haven't actually tried comparing the addresses of the UpdateID objects though, it's possible the allocator is giving me relatively contiguous chunks.
Joseph Garvin
A: 

The answer was tricky to figure out, and to be fair would be hard with the information I presented though if anyone actually compiled the source code I provided they'd have a fighting chance ;) I said that "found via update list" was printed after every latency log message, but this wasn't actually true -- it was only true for as far as I could scrollback in my terminal. At the very beginning there were a slew of updates found without using the update list.

The issue is that between the time when I set my starting point in the update list and my starting point in each of the data lists, there is going to be some lag because these operations take time. Remember, the lists are growing the whole time this is going on. Consider the simplest case where I have 2 data lists, A and B. When I set my starting point in the update list there happen to be 60 elements in it, due to 30 updates on list A and 30 updates on list B. Say they've alternated:

A
B
A
B
A // and I start looking at the list here
B

But then after I set the update list to there, there are a slew of updates to B and no updates to A. Then I set my starting places in each of the data lists. My starting points for the data lists are going to be after that surge of updates, but my starting point in the update list is before that surge, so now I'm going to check for a bunch of updates without finding them. The mixed approach above works best because by iterating over all the elements when it can't find an update, it quickly closes the temporal gap between where the update list is and where the data lists are.

Joseph Garvin