tags:

views:

665

answers:

5

It's known that calloc differentiates itself with malloc in which it initializes the memory alloted. With calloc, the memory is set to zero. With malloc, the memory is not cleared.
So in everyday work, i regard calloc as malloc+memset.
Incidentally, for fun, i wrote the following codes for benchmark. The result is confused.

Code 1:

#include<stdio.h>
#include<stdlib.h>
#define BLOCK_SIZE 1024*1024*256
int main()
{
        int i=0;
        char *buf[10];
        while(i<10)
        {
                buf[i] = (char*)calloc(1,BLOCK_SIZE);
                i++;
        }
}

time ./a.out
real 0m0.287s
user 0m0.095s
sys 0m0.192s

Code 2:

#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#define BLOCK_SIZE 1024*1024*256
int main()
{
        int i=0;
        char *buf[10];
        while(i<10)
        {
                buf[i] = (char*)malloc(BLOCK_SIZE);
                memset(buf[i],'\0',BLOCK_SIZE);
                i++;
        }
}

time ./a.out
real 0m2.693s
user 0m0.973s
sys 0m1.721s

Repalce memset with bzero(buf[i],BLOCK_SIZE) in Code 2 produce the result alike.
My Question is that why malloc+memset is so much slower than calloc? How can calloc do that ? Thanks!

A: 

Optimisation behind the scenes??


The memset() call would be slowing down your numbers.

Try profiling a code 3 with only the following, to get the idea:

    while(i<10)
    {
            buf[i] = (char*)malloc(BLOCK_SIZE);
            i++;
    }

while memset() can be used to set the memory to zero. it is just a "special-case" usage of memset() for which it is not optimised.

Calloc() is just designed to do that. (Alloc mem init-ed to 0x00 values)


So, what does memset do inside?
Here is a snippet from MS c-run-time codes:

void * __cdecl memset (void *dst, int val, size_t count)
{

void *start = dst;

extern void RtlFillMemory( void *, size_t count, char );  
RtlFillMemory( dst, count, (char)val );

while (count--) 
{
    *(char *)dst = (char)val;
    dst = (char *)dst + 1;
}

return(start);

}
  • In conclusion, memset initializes a block of memory byte by byte.
CVS-2600Hertz
That's very unusual, as far as memset implementations go. The few implementations I've seen are heavily optimized assembly.
Dietrich Epp
@ Dietrich hmmm... A detailed discussion here http://www.velocityreviews.com/forums/t486007-memset-is-faster-than-simple-loop.html Just followed it...
CVS-2600Hertz
+1  A: 

Because on many systems, in spare processing time, the OS goes around setting free memory to zero on its own and marking it safe for calloc(), so when you call calloc(), it may already have free, zeroed memory to give you.

Chris Lutz
Are you sure? Which systems do this? I thought that most OSs just shut down the processor when they were idle, and zeroed memory on demand for the processes that allocated as soon as they write to that memory (but not when they allocate it).
Dietrich Epp
@Dietrich - Not sure. I heard it once and it seemed like a reasonable (and reasonably simple) way to make `calloc()` more efficient.
Chris Lutz
I think you're probably wrong, but i'm not sure.
Pierreten
@Pierreten - I can't find any good info on `calloc()`-specific optimizations and I don't feel like interpreting libc source code for the OP. Can you look up anything to show that this optimization doesn't exist / doesn't work?
Chris Lutz
@Chris: The code you're looking for is not in the libc code but in the kernel. Libc just calls sbrk or mmmap, and it knows that the kernel only hands out zeroed memory from those two syscalls, so it doesn't zero that memory again.
Dietrich Epp
A: 

On some platforms in some modes malloc initialises the memory to some typically non-zero value before returning it, so the second version could well initialize the memory twice

Stewart
+19  A: 

This is probably due to your large allocation size. You might want to read up on how virtual memory works and OS theory.

When you allocate a large enough region of memory (the threshold is often 1 MiB if memory serves), most allocators will get a new region of memory from the kernel using "mmap" just for that region. However, when "mmap" gives you new pages of memory, they have to be initialized to zero (when using MAP_ANONYMOUS). If they weren't, they'd be filled with all sorts of junk from other applications — and this is a serious security vulnerability. What if root was editing /etc/shadow with those pages? The same also applies if "malloc" runs out of memory for small allocations and calls "sbrk" to get more.

But it would take too long to zero all that memory. The kernel cheats. There is a page of memory already zeroed set aside. All pages in the new allocation point at this one page of physical ram, which is shared among all processes on the system, so it doesn't actually use any memory. It's marked as read-only. As soon as you write to it, the processor raises an exception. The kernel's exception handler then grabs a page of RAM (possibly swapping something else out), fills it with zeroes, and maps it into your process's address space. The "calloc" function can exploit this.

(Actually, the kernel can go a step further, and have "mmap" do nothing to your process's memory until you read from it.)

The "memset" implementation touches every page in the allocation, resulting in much higher memory usage — it forces the kernel to allocate those pages now, instead of waiting until you actually use them.

The "calloc" implementation just changes a few page tables, consumes very little actual memory, writes to very little memory, and returns. On most systems you can even allocate more memory than your system can support (more than RAM + swap) without problems, as long as you don't write to all of it. (This feature is slightly controversial on the operating systems that allow this.)

Some systems do not support virtual memory: very old ones (think 80286) and some embedded systems. On these systems, the speeds might be much closer.


There are a few guesses in other answers that hypothesize that "memset" is slower than "calloc" because "memset" can't assume the memory is aligned. Here is how a typical "memset" implementation works:

function memset(dest, c, len)
    // one byte at a time, until the dest is aligned...
    while (len > 0 && ((unsigned int)dest & 15))
        *dest++ = c
        len -= 1
    // now write big chunks at a time (processor-specific)...
    // block size might not be 16, it's just pseudocode
    while (len >= 16)
        // some optimized vector code goes here
        // glibc uses SSE2 when available
        dest += 16
        len -= 16
    // the end is not aligned, so one byte at a time
    while (len > 0)
        *dest++ = c
        len -= 1

In a 256 MiB chunk, that first and last loop are going to be negligible, and the middle loop is the same as the hypothetical calloc loop. Some compilers inline "memset", and can even infer that the result of "malloc" is an aligned block. And a typical "calloc" implementation just calls "memset" anyway — "calloc" is usually written in C, and it's often portable across operating systems.

The other guess I saw was that "malloc" already initializes the memory, so "memset" initializes it twice. This is technically true in this case. However, it would only account for about a factor of two in speed. The "calloc" version is ten to fifteen hundred times faster. The numbers do not support the conclusion.


Footnote: Just for giggles, I timed the two programs on two of my computers. On my OS X / PowerPC box, the memset version was over 1500x slower (8 s versus 5 ms). On my Linux / x86 box, the memset version ran for 35x as long before segfaulting (expected, that computer has less RAM — note though that the calloc version didn't crash).

Dietrich Epp
@Dietrich: the virtual memory explaination of Dietrich about OS allocating the same zero filled page many times for calloc is easy to check. Just add some loop that write junk data in every allocated memory page (writing one byte every 500 bytes should be enough). The overall result should then become much closer as system would be forced to really allocate differents pages in both cases.
kriss
@kriss: indeed, although one byte every 4096 is sufficient on the vast majority of systems
Dietrich Epp
Good to see someone with real OS knowledge answering this question.
rjh
+2  A: 

Probably you will find that calloc() knows that the space it allocated is a multiple of 8 bytes allocated on an 8 byte boundary, so it can use an appropriately long integer assignment to zero many bytes at a time whereas memset() is more constrained because it can be called to set odd numbers of bytes at odd offsets and so on.

Jonathan Leffler
Incorrect. I've looked at memset implementations, and I can assure you that they do not suffer from performance penalties when writing to non-aligned regions: they check for alignment and handle the beginning/end of the region specially. For a 256 meg block, the difference would be so small you couldn't measure it.
Dietrich Epp
@Dietrich - Assuming `memset()` is implemented well. Which should be a reasonable assumption.
Chris Lutz