When using malloc to allocate memory, is it generally quicker to do multiple mallocs of smaller chunks of data or fewer mallocs of larger chunks of data? For example, say you are working with an image file that has black pixels and white pixels. You are iterating through the pixels and want to save the x and y position of each black pixel in a new structure that also has a pointer to the next and previous pixels x and y values. Would it be generally faster to iterate through the pixels allocating a new structure for each black pixel's x and y values with the pointers, or would it be faster to get a count of the number of black pixels by iterating through once, then allocating a large chunk of memory using a structure containing just the x and y values, but no pointers, then iterating through again, saving the x and y values into that array? I'm assuming certain platforms might be different than others as to which is faster, but what does everyone think would generally be faster?
Generally speaking, allocating larger chunks of memory fewer times will be faster. There's overhead involved each time a call to malloc() is made.
It depends:
- Multiple small times means multiple times, which is slower
- There may be a special/fast implementation for small allocations.
If I cared, I'd measure it! If I really cared a lot, and couldn't guess, then I might implement both, and measure at run-time on the target machine, and adapt accordingly.
In general I'd assume that fewer is better: but there are size and run-time library implementations such that a (sufficiently) large allocation will be delegated to the (relatively slow) O/S. whereas a (sufficiently) small allocation will be served from a (relatively quick) already-allocated heap.
Allocating large blocks is more efficient; additionally, since you are using larger contiguous blocks, you have greater locality of reference, and traversing your in-memory structure once you've generated it should also be more efficient! Further, allocating large blocks should help to reduce memory fragmentation.
As already mentonned, malloc is costly, so fewer will probably be faster. Also, working with the pixels, on most platforms will have less cache-misses and will be faster. However, there is no guarantee on every platforms
Allocating memory is work. The amount of work done when allocating a block of memory is typically independent of the size of the block. You work it out from here.
In general malloc is expensive. It has to find an appropriate memory chunk from which to allocate memory and keep track of non-contiguous memory blocks. In several libraries you will find small memory allocators that try to minimize the impact by allocating a large block and managing the memory in the allocator.
Alexandrescu deals with the problem in 'Modern C++ Design' and in the Loki library if you want to take a look at one such libs.
Except speed issues there is also the memory fragmentation problem.
This question is one of pragmatism, I'm afraid; that is to say, it depends.
If you have a LOT of pixels, only a few of which are black then counting them might be the highest cost.
If you're using C++, which your tags suggest you are, I would strongly suggest using STL, somthing like std::vector.
The implementation of vector, if I remember correctly, uses a pragmatic approach to allocation. There are a few heuristics for allocation strategies, an informative one is this:
class SampleVector {
int N,used,*data;
public:
SampleVector() {N=1;used=0;data=malloc(N);}
void push_back(int i)
{
if (used>=N)
{
// handle reallocation
N*=2;
data=realloc(data,N);
}
data[used++]=i;
}
};
In this case, you DOUBLE the amount of memory allocated every time you realloc. This means that reallocations progressively halve in frequency.
Your STL implementation will have been well-tuned, so if you can use that, do!
Next to the allocation overhead itself, allocating multiple small chunks may result in lots of cache misses, while if you can iterate through a contiguous block, chances are better.
The scenario you describe asks for preallocation of a large block, imho.
Although allocating large blocks is faster per byte of allocated memory, it will probably not be faster if you artificially increase the allocation size only to chop it up yourself. You're are just duplicating the memory management.
It's faster not to allocate in performance-sensitive code at all. Allocate the memory you're going to need once in advance, and then use and reuse that as much as you like.
Memory allocation is a relatively slow operation in general, so don't do it more often than necessary.
"I can allocate-it-all" (really, I can!)
We can philosophy about some special implementations, that speed up small allocations considerably ... yes! But in general this holds:
malloc must be general. It must implement all different kinds of allocations. That is the reason it is considerably slow! It might be, that you use a special kinky-super-duper Library, that speeds things up, but also those can not do wonders, since they have to implement malloc in its full spectrum.
The rule is, when you have more specialized allocation coding, you are always faster then the broad "I can allocate-it-all" routine "malloc".
So when you are able to allocate the memory in bigger blocks in your coding (and it does not cost you to much) you can speed up things considerably. Also - as mentioned by others - you will get lot less fragmentation of memory, that also speeds things up and can cost less memory. You must also see, that malloc needs additional memory for every chunk of memory it returns to you (yes, special routines can reduce this ... but you don't know! what it does really unless you implemented it yourself or bought some wonder-library).
Another point to consider is how this interacts with threading. Using malloc many times in a threaded concurrent application is a major drag on performance. In that environment you are better off with a scalable allocator like the one used in Intel's Thread Building Blocks or Hoard. The major limitation with malloc is that there is a single global lock that all the threads contend for. It can be so bad that adding another thread dramatically slows down your application.
Do an iteration over the pixels to count the number of them to be stored. Then allocate an array for the exact number of items. This is the most efficient solution.
You can use std::vector for easier memory management (see the std::vector::reserve procedure). Note: reserve will allocate probably a little (probably up to 2 times) more memory then necessary.