views:

190

answers:

2

This is a concurrent queue I wrote which I plan on using in a thread pool I'm writing. I'm wondering if there are any performance improvements I could make. atomic_counter is pasted below if you're curious!

#ifndef NS_CONCURRENT_QUEUE_HPP_INCLUDED
#define NS_CONCURRENT_QUEUE_HPP_INCLUDED

#include <ns/atomic_counter.hpp>
#include <boost/noncopyable.hpp>
#include <boost/smart_ptr/detail/spinlock.hpp>
#include <cassert>
#include <cstddef>

namespace ns {
    template<typename T,
             typename mutex_type = boost::detail::spinlock,
             typename scoped_lock_type = typename mutex_type::scoped_lock>
    class concurrent_queue : boost::noncopyable {
        struct node {
            node * link;
            T const value;
            explicit node(T const & source) : link(0), value(source) { }
        };
        node * m_front;
        node * m_back;
        atomic_counter m_counter;
        mutex_type m_mutex;
    public:
        // types
        typedef T value_type;

        // construction
        concurrent_queue() : m_front(0), m_mutex() { }
        ~concurrent_queue() { clear(); }

        // capacity
        std::size_t size() const { return m_counter; }
        bool empty() const { return (m_counter == 0); }

        // modifiers
        void push(T const & source);
        bool try_pop(T & destination);
        void clear();
    };

    template<typename T, typename mutex_type, typename scoped_lock_type>
    void concurrent_queue<T, mutex_type, scoped_lock_type>::push(T const & source) {
        node * hold = new node(source);
        scoped_lock_type lock(m_mutex);
        if (empty())
            m_front = hold;
        else
            m_back->link = hold;
        m_back = hold;
        ++m_counter;
    }

    template<typename T, typename mutex_type, typename scoped_lock_type>
    bool concurrent_queue<T, mutex_type, scoped_lock_type>::try_pop(T & destination) {
        node const * hold;
        {
            scoped_lock_type lock(m_mutex);
            if (empty())
                return false;
            hold = m_front;
            if (m_front == m_back)
                m_front = m_back = 0;
            else
                m_front = m_front->link;
            --m_counter;
        }
        destination = hold->value;
        delete hold;
        return true;
    }

    template<typename T, typename mutex_type, typename scoped_lock_type>
    void concurrent_queue<T, mutex_type, scoped_lock_type>::clear() {
        node * hold;
        {
            scoped_lock_type lock(m_mutex);
            hold = m_front;
            m_front = 0;
            m_back = 0;
            m_counter = 0;
        }
        if (hold == 0)
            return;
        node * it;
        while (hold != 0) {
            it = hold;
            hold = hold->link;
            delete it;
        }
    }
}

#endif

atomic_counter.hpp

#ifndef NS_ATOMIC_COUNTER_HPP_INCLUDED
#define NS_ATOMIC_COUNTER_HPP_INCLUDED

#include <boost/interprocess/detail/atomic.hpp>
#include <boost/noncopyable.hpp>

namespace ns {
    class atomic_counter : boost::noncopyable {
        volatile boost::uint32_t m_count;
    public:
        explicit atomic_counter(boost::uint32_t value = 0) : m_count(value) { }

        operator boost::uint32_t() const {
            return boost::interprocess::detail::atomic_read32(const_cast<volatile boost::uint32_t *>(&m_count));
        }

        void operator=(boost::uint32_t value) {
            boost::interprocess::detail::atomic_write32(&m_count, value);
        }

        void operator++() {
            boost::interprocess::detail::atomic_inc32(&m_count);
        }

        void operator--() {
            boost::interprocess::detail::atomic_dec32(&m_count);
        }
    };
}

#endif
+1  A: 

I think you will run into performance problems with a linked list in this case because of calling new for each new node. And this isn't just because calling the dynamic memory allocator is slow. It's because calling it frequently introduces a lot of concurrency overhead because the free store has to be kept consistent in a multi-threaded environment.

I would use a vector that you resize to be larger when it's too small to hold the queue. I would never resize it smaller.

I would arrange the front and back values so the vector is a ring buffer. This will require that you move elements when you resize though. But that should be a fairly rare event and can be mitigated to some extent by giving a suggested vector size at construction.

Alternatively you could keep the linked list structure, but never destroy a node. Just keep adding it to a queue of free nodes. Unfortunately the queue of free nodes is going to require locking to manage properly, and I'm not sure you're really in a better place than if you called delete and new all the time.

You will also get better locality of reference with a vector. But I'm not positive how that will interact with the cache lines having to shuttle back and forth between CPUs.

Some others suggest a ::std::deque and I don't think that's a bad idea, but I suspect the ring buffer vector is a better idea.

Omnifarious
+1  A: 

Herb Sutter proposed an implementation of a lock-free queue that would surely outperform yours :)

The main idea is to use a buffer ring, forgoing the dynamic allocation of memory altogether during the run of the queue. This means that the queue can be full (and thus you may have to wait to put an element in), which may not be acceptable in your case.

As Omnifarious noted, it would be better not to use a linked-list (for cache locality), unless you allocate for a pool. I would try using a std::deque as a backend, it's much more memory friendly and guarantees there won't be any reallocation as long as you only pop and push (at the front and back) which is the case for a queue normally.

Matthieu M.
I did not know of Herb Sutter's idea. :-) I would have to look at the details, but I bet you could implement the same kind of expanding (but never contracting) buffer that I suggest on top of it.
Omnifarious
Probably, see this Google Talk by Dr Cliff Click on realizing a lock-free Hash Table for Java. The core idea is of course a dynamic array: http://video.google.com/videoplay?docid=2139967204534450862#, what's nice is that the cost of reallocation is spread among multiple writes, so you don't freeze one thread when the need to reallocate kick in. However you still have to ask for memory, which always takes a bit of time.
Matthieu M.