ansaurus

Question

On the Design of zero-copy Memory allocators used in high volume fast-path code.

Answer 1

+1 A:

Perhaps you are looking for something like the GLib Slice Allocator?

It is designed for high-speed allocation and deallocation of similarly-sized objects. It also claims to have excellent performance in a multi-threaded environment.

The one aspect of the API that might cause trouble for you is that you must give the size of your object during deallocation as well as allocation.

Adam Goode 2009-12-04 16:37:19

the characteristics are nice, letting the code dictate how the sizing is done, and delaying the return to OS, very nice for fix sized data structures. I wonder if I would benifit from creating 128-byte intervals for the messages, as they would be dynamic and this would create static sized intervals. I will look at how many dependencies the header has and if it works well (or has support for) windows.

Hassan Syed 2009-12-04 21:55:13

I use GLib on Windows, it works quite well.

Adam Goode 2009-12-05 00:33:48

Answer 2

+2 A:

After reading others' comments, it sounds like using a fixed pool of objects and allocating ownership of them may be a better idea than trying to mess with allocators.

Presumably you have a large number of threads, so any allocator would need to be doing quite a bit of locking; a large number of small allocs/ frees sounds undesirable.

If you think you've got enough memory to keep reasonable sized pools per thread, you can just allocate a fixed pool per thread which contains enough IPC objects to handle all the requests that thread will have outstanding at one time, and allocate them per-thread from this pool with no locking whatsoever, marking them for reuse once those requests are complete.

Once you move the other processes off-box, the problem becomes completely different as zero-copy is no longer an option (the data clearly need to be copied at least once to reach another machine) so I think you'll have other issues.

MarkR 2009-12-04 22:01:15

Currently I use number_of_cpu * cpu-bound threads for the single named channel. I was thinking of having a tunable amount of threads, and now that you mention it - I will remove any shared state even for threads multiplexing the same channel. So, you are suggesting a pool of pages, do you agree with my strategy of filling a page linearly, and not reclaiming space by delivered messages, and returning only after the page has is entirely free ? Secondly, If I use Erlang I have to format the message into an Erlang friendly format (assumption); this is similar to forwarding message off-machine.

Hassan Syed 2009-12-04 22:45:27

I suppose one way of dealing with the input-format and the next-stage-format (NSF) of the message is to deterministically pre-allocate the size of the NSF in the same buffer. Another would be to make Erlang understand my IDL and just make a call to deliver the message; I think this method might be the best, I can deliver the message straight into an Erlang amnesia table (which I can pre-generate from my IDL).

Hassan Syed 2009-12-04 22:50:24

In terms of size, if this thing is living on a sparc with 32 cores and 32 web servers; I can just give each thread 32 * 0.5,or 1-mb to be on the safe side. That should be perfectly justifiable on such a powerful machine.

Hassan Syed 2009-12-04 22:54:23

I'm afraid I don't really quite understand any of the above (I know nothing about Erlang) but I hope my comments were helpful :)

MarkR 2009-12-06 21:50:38

Answer 3

+1 A:

Answer: Think about WHAT you want to reach.

Do you want zero-copy semantics when processing the messages?
Or do you want a fancy new uberfast memory allocator?

A "zero copy memory allocator" is non-sense!

Writing a general purpose memory allocator is difficult (most probably the default allocator is perfect for your purposes). If you need an idea for a good allocator for your purposes, then please provide some more details.

The zero-copy thing: it depends a lot of other things in your application, but for example I assume a piece of code that reads a stream from a named pipe, and another piece of code that parses this stream, and interprets this as messages.

First, reading from IPC pipe: have a list of buffers: read() from the FIFO into buffers[current_write_buffer_idx].data+write_pos a maximum of (buffer_size-write_pos) bytes. Add number of read bytes to write_pos, and so on. If a buffer is full, then check your list of buffers whether you find an unused one (keep a usage flag per buffer). If no unused one is found, then allocate a new buffer. Set next_buffer_index on the old buffer (used for the "parsing" later..).

You end up with:

size_t write_pos, read_pos;
struct Buffer {
    void* data;
    bool is_used;
    size_t number_bytes_written;
    size_t next_buffer_index;
};
Buffer *buffers;
size_t number_buffers;
size_t current_write_buffer_idx;
size_t current_read_buffer_idx;

Second, parsing the stream: find the next byte at buffers[current_read_buffer_index].data+read_pos. You can read bytes from that buffer up to buffer.number_bytes_written bytes, if number of written bytes equals buffer_size (fixed), then the buffer is full, and you have to continue reading from the buffer at next_buffer_index...

And so on. Set is_used true whenever you write to a new buffer, set it to false whenever you have done parsing the buffer.

What "reading" exactly means, depends a lot of how your message are coded!

So long... to make it (almost) zero-copy, you have to provide more details. When messages are fixed size its too easy.. But assuming your messages are prefixed with type and/or size information and the data of the message is some kind of struct, then you could do the following:

read type and/or size information from stream and..
if the message is not yet completely received (read_pos, number_bytes_written, ...) then skip ...
if the message (the size is known now) fits into a single buffer, just use the pointer into this buffer (struct MsgXYZ*)( buffers[current_read_buffer_idx].data + start_of_message_index ) and pass it to whatever code does the job
if the message does spawn the buffers, create a temporary buffer and copy the parts of the message together, pass pointer to the message to code.. after that free the temporary buffer (its a rare case anyway)

Well, I omitted a lot of details here ;) Think twice if zero copy really makes sense to your app.

Zero copy when reading from a socket/pipe/... is IMHO not possible, maybe through shared memory of some kind of mmap, but not with pipes.

frunsi 2009-12-05 03:04:18

Hi, Its getting late and I promise I will read the entire post tommorow. Named pipes in windows and Unix Dom sockets both allow message semantics; this means that the recv() equavalent function (at least in windows) can be used to return the size of the incoming message.

Hassan Syed 2009-12-05 03:12:51

ansaurus

tags:

views:

answers:

On the Design of zero-copy Memory allocators used in high volume fast-path code.

related questions