Disclaimer: please do not bother reading this long post unless you are an embedded programmer, or a kernel developer, or a high performance system software programmer. It is about memory allocators and might not be of any interest to you.
I am building a library for high-volume high-performance same machine IPC, using OS specific pipes (named pipes on windows and UNIX domain sockets on UNIX). The library receives messages, and inject them into a processing engine, and this engine could be in the same process in C++ or something like Erlang or Lua, or perhaps moved off to a central machine.
The messages at the moment come from within a web-server and is part of the resolution pipeline of a HTTP request, and should scale with the web-server -- i.e., I cannot write an inefficient system because a web-servers worker threads depend on the speed with which they can offload these messages. If a web-server is capable of handling 10k requests of static pages, it should not be massively hampered by this IPC mechanism.
The data comes in as typed messages, and I believe they will have their own size statistics which could be leveraged. Doug lea's allocator creates linked lists of buckets by sizes if I remember correctly, perhaps this is the way I should go - it is the standard of a general allocator.
I want zero-copy semantics so till the message is done processing it should not by copied around. On a side note: this perhaps takes Googles protocol buffers out of the picture (can't be sure at the moment) as the wire format is different from the in-memory object format. I will investigate further -- my current implementation uses my own IPC format; perhaps someone has some opinions of how protocol buffers does things, does it use zero-copy internally ?
I want to use this allocator in all areas of the application. My study of network stack software is urging me towards a processor-page-aligned-pool based allocator; my application certainly has the same access semantics. I would copy all messages into a page and only recycle the pages once all element references -- that are allocated to the pool -- are inactive.
To slightly formalize and help you understand what I am thinking about consider the following data structure:
typedef struct _page
{
int active_elements; // page is returned to empty pool at zero
int free_bytes; // used to sort the pages
void * next_free_contigious; // pointer to where next allocation may occur
_page * next_page; // linked list next page pointer
} page;
struct pageAllocator
{
page * ll_empty; // completely empty pages
page * ll_active; // sorted by highest_size_available, could be binary tree
// continue reading for rationale.
};
Some points are :
- The messages won't linger arround. So pages will become free fairly soon.
- I need to write bullet-proof code to free the pages.
- I do not know if page->next_free_contigious needs to be word alligned.
- I assume that under pressure the amount of pages allocated will be stable.
- Should I be returning pages to the OS or should I just keep them ? Maximum pressure would only allocate as much as it needs once maximum pressure has been established.
- Should I allocate messages to the first element in the size sorted list (first being the highest), or should I find the best matching page. In this case I would have to use a binary tree of some sort. Am I re-implementing doug's algorithm here ?
I want suggestions from people who have personally dealt with such things, alternate allocator suggestions would be welcome. Off the shelf libraries would be great (they have to be ANSI-C or a c-pre-processor meta-code implementation). The APR is a popular library with a hierarchical pool based strategy but I am not dealing with sessions at this low-level so I don't need a hierarchy.
I don't think I am pre-optimizing because I am convinced I need it. My current implementation of this system jumps to 30% CPU during web-server CPU saturation and I know this is because of my naive way of programming the application using new and a lot of copy constructing in C++. I will of-course benchmark my application before I proceed (I am too busy atm to re-build a test environment), but any discussion here will be useful regardless of the outcome.
--- edit 1 ---
This implementation was inspired from some comments from Markr, check answer below for some discussion. A bit more thinking lead to the following.
Ok I am thinking about a circular buffer now (say 1mb^x). Composed of equal sized elements (perhaps some multiple of 128-bytes). The allocations happen incrementally till the end (with a allocatex-head(A) marking where allocations may occur from, and a free-marker(F) marking contigious free regions(0) being returned to the pool. Consider the following ASCII illustration:
|000000F101110001111A00000000|
0=free, 1= occupied.
Some specific are:
- The free-marker delimits free contiguous regions. In the illustration F will only move two forward when the immediately following chunk is returned.
- Once the allocation reaches the end, or there isn't enough space at the end to perform a contiguous allocation, the Allocate-head moves to the beginning; most of the buffer should be free by this point.
- once a message is moved to its destination the chunks are immediately returned to the buffer.
- On same machine delivery the messages will be delivered in short order (some blocking), or immediately if it just requires a thread-free function call.
Comments would be nice, and I would still prefer some self-contained portable library if anyone knows of any that does something similar. The GLIBC slice allocator looked nice, but I believe the above implementation is simpler and probably faster, and active regions on the buffer will always have good locality.