I use a mostly O(1) Small Block Memory Manager (SBMM). Basically it works this way:
1) It allocates larger SuperBlocks from the OS and tracks the Start+End Addresses as a range. The size of the SuperBlock is adjustable but 1MB makes a pretty good size.
2) The SuperBlocks are broken into Blocks (also adjustable in size... 4K-64K is good depending on your app). Each of these Blocks handles allocations of a specific size and stores all the items in the Block as a singly linked list. When you allocate a SuperBlock, you make a linked list of Free Blocks.
3) Allocating an Item means A) Checking to see if there is a Block with Free Items handling that size - and if not, allocating a new Block from the SuperBlocks. B) Removing the Item from the Block's Free List.
4) Freeing an Item by address means A) Finding SuperBlock containing address(*) B) Finding Block in SuperBlock (substract SuperBlock start address and divide by Block size) C) Pushing Item back on the Block's Free Item list.
As I stated, this SBMM is very fast as it runs with O(1) performance(*). In the version I have implemented, I use an AtomicSList (similar to SLIST in Windows) so that it is not only O(1) performance but also ThreadSafe and LockFree in the implementation. You could actually implement the algorithm using Win32 SLIST if you wanted to.
Interestingly, the algorithm for allocating Blocks from the SuperBlocks or Items from the Blocks result in nearly identicaly code (they're both O(1) allocations off a Free List).
(*) The SuperBlocks are arranged in a rangemap with O(1) average performance (but a potential O(Lg N) for worstcase where N is the number of SuperBlocks). The width of the rangemap depends on knowing roughly how much memory you're going to need in order to get the O(1) performance. If you overshoot, you'll waste a bit of memory but still get O(1) performance. If you undershoot, you'll approach O(Lg N) performance but the N is for the SuperBlock count -- not the item count. Since the SuperBlock count is very low compared to the Item count (by about 20 binary orders of magnitude in my code), it is not as critical as the rest of the allocator.