I have some project where I have a single producer thread which writes events into a buffer, and an additional single consumer thread which takes events from the buffer. My goal is to optimize this thing for a single dual-core machine to achieve maximum throughput.
Currently, I am using some simple lock-free ring buffer (lock-free is possible since I have only one consumer and one producer thread and therefore the pointers are only updated by a single thread).
#define BUF_SIZE 32768
struct buf_t { volatile int writepos; volatile void * buffer[BUF_SIZE];
volatile int readpos;) };
void produce (buf_t *b, void * e) {
int next = (b->writepos+1) % BUF_SIZE;
while (b->readpos == next); // queue is full. wait
b->buffer[b->writepos] = e; b->writepos = next;
}
void * consume (buf_t *b) {
while (b->readpos == b->writepos); // nothing to consume. wait
int next = (b->readpos+1) % BUF_SIZE;
void * res = b->buffer[b->readpos]; b->readpos = next;
return res;
}
buf_t *alloc () {
buf_t *b = (buf_t *)malloc(sizeof(buf_t));
b->writepos = 0; b->readpos = 0; return b;
}
However, this implementation is not yet fast enough and should be optimized further. I've tried with different BUF_SIZE
values and got some speed-up. Additionaly, I've moved writepos
before the buffer
and readpos
after the buffer
to ensure that both variables are on different cache lines which resulted also in some speed.
What I need is a speedup of about 400 %. Do you have any ideas how I could achieve this using things like padding etc?