This is probably due to your large allocation size. You might want to read up on how virtual memory works and OS theory.
When you allocate a large enough region of memory (the threshold is often 1 MiB if memory serves), most allocators will get a new region of memory from the kernel using "mmap" just for that region. However, when "mmap" gives you new pages of memory, they have to be initialized to zero (when using MAP_ANONYMOUS
). If they weren't, they'd be filled with all sorts of junk from other applications — and this is a serious security vulnerability. What if root was editing /etc/shadow
with those pages? The same also applies if "malloc" runs out of memory for small allocations and calls "sbrk" to get more.
But it would take too long to zero all that memory. The kernel cheats. There is a page of memory already zeroed set aside. All pages in the new allocation point at this one page of physical ram, which is shared among all processes on the system, so it doesn't actually use any memory. It's marked as read-only. As soon as you write to it, the processor raises an exception. The kernel's exception handler then grabs a page of RAM (possibly swapping something else out), fills it with zeroes, and maps it into your process's address space. The "calloc" function can exploit this.
(Actually, the kernel can go a step further, and have "mmap" do nothing to your process's memory until you read from it.)
The "memset" implementation touches every page in the allocation, resulting in much higher memory usage — it forces the kernel to allocate those pages now, instead of waiting until you actually use them.
The "calloc" implementation just changes a few page tables, consumes very little actual memory, writes to very little memory, and returns. On most systems you can even allocate more memory than your system can support (more than RAM + swap) without problems, as long as you don't write to all of it. (This feature is slightly controversial on the operating systems that allow this.)
Some systems do not support virtual memory: very old ones (think 80286) and some embedded systems. On these systems, the speeds might be much closer.
There are a few guesses in other answers that hypothesize that "memset" is slower than "calloc" because "memset" can't assume the memory is aligned. Here is how a typical "memset" implementation works:
function memset(dest, c, len)
// one byte at a time, until the dest is aligned...
while (len > 0 && ((unsigned int)dest & 15))
*dest++ = c
len -= 1
// now write big chunks at a time (processor-specific)...
// block size might not be 16, it's just pseudocode
while (len >= 16)
// some optimized vector code goes here
// glibc uses SSE2 when available
dest += 16
len -= 16
// the end is not aligned, so one byte at a time
while (len > 0)
*dest++ = c
len -= 1
In a 256 MiB chunk, that first and last loop are going to be negligible, and the middle loop is the same as the hypothetical calloc loop. Some compilers inline "memset", and can even infer that the result of "malloc" is an aligned block. And a typical "calloc" implementation just calls "memset" anyway — "calloc" is usually written in C, and it's often portable across operating systems.
The other guess I saw was that "malloc" already initializes the memory, so "memset" initializes it twice. This is technically true in this case. However, it would only account for about a factor of two in speed. The "calloc" version is ten to fifteen hundred times faster. The numbers do not support the conclusion.
Footnote: Just for giggles, I timed the two programs on two of my computers. On my OS X / PowerPC box, the memset version was over 1500x slower (8 s versus 5 ms). On my Linux / x86 box, the memset version ran for 35x as long before segfaulting (expected, that computer has less RAM — note though that the calloc version didn't crash).