ansaurus

Question

Answer 1

A:

Optimisation behind the scenes??

The memset() call would be slowing down your numbers.

Try profiling a code 3 with only the following, to get the idea:

    while(i<10)
    {
            buf[i] = (char*)malloc(BLOCK_SIZE);
            i++;
    }

while memset() can be used to set the memory to zero. it is just a "special-case" usage of memset() for which it is not optimised.

Calloc() is just designed to do that. (Alloc mem init-ed to 0x00 values)

So, what does memset do inside?
Here is a snippet from MS c-run-time codes:

void * __cdecl memset (void *dst, int val, size_t count)
{

void *start = dst;

extern void RtlFillMemory( void *, size_t count, char );  
RtlFillMemory( dst, count, (char)val );

while (count--) 
{
    *(char *)dst = (char)val;
    dst = (char *)dst + 1;
}

return(start);

}

In conclusion, memset initializes a block of memory byte by byte.

CVS-2600Hertz 2010-04-22 05:47:41

That's very unusual, as far as memset implementations go. The few implementations I've seen are heavily optimized assembly.

Dietrich Epp 2010-04-22 06:08:17

@ Dietrich hmmm... A detailed discussion here http://www.velocityreviews.com/forums/t486007-memset-is-faster-than-simple-loop.html Just followed it...

CVS-2600Hertz 2010-04-22 06:31:04

Answer 2

+1 A:

Because on many systems, in spare processing time, the OS goes around setting free memory to zero on its own and marking it safe for calloc(), so when you call calloc(), it may already have free, zeroed memory to give you.

Chris Lutz 2010-04-22 05:48:20

Are you sure? Which systems do this? I thought that most OSs just shut down the processor when they were idle, and zeroed memory on demand for the processes that allocated as soon as they write to that memory (but not when they allocate it).

Dietrich Epp 2010-04-22 06:00:29

@Dietrich - Not sure. I heard it once and it seemed like a reasonable (and reasonably simple) way to make `calloc()` more efficient.

Chris Lutz 2010-04-22 06:06:09

I think you're probably wrong, but i'm not sure.

Pierreten 2010-04-22 06:09:10

@Pierreten - I can't find any good info on `calloc()`-specific optimizations and I don't feel like interpreting libc source code for the OP. Can you look up anything to show that this optimization doesn't exist / doesn't work?

Chris Lutz 2010-04-22 06:13:24

@Chris: The code you're looking for is not in the libc code but in the kernel. Libc just calls sbrk or mmmap, and it knows that the kernel only hands out zeroed memory from those two syscalls, so it doesn't zero that memory again.

Dietrich Epp 2010-04-22 06:35:21

Answer 3

A:

On some platforms in some modes malloc initialises the memory to some typically non-zero value before returning it, so the second version could well initialize the memory twice

Stewart 2010-04-22 05:51:20

Answer 4

+19 A:

This is probably due to your large allocation size. You might want to read up on how virtual memory works and OS theory.

When you allocate a large enough region of memory (the threshold is often 1 MiB if memory serves), most allocators will get a new region of memory from the kernel using "mmap" just for that region. However, when "mmap" gives you new pages of memory, they have to be initialized to zero (when using MAP_ANONYMOUS). If they weren't, they'd be filled with all sorts of junk from other applications — and this is a serious security vulnerability. What if root was editing /etc/shadow with those pages? The same also applies if "malloc" runs out of memory for small allocations and calls "sbrk" to get more.

But it would take too long to zero all that memory. The kernel cheats. There is a page of memory already zeroed set aside. All pages in the new allocation point at this one page of physical ram, which is shared among all processes on the system, so it doesn't actually use any memory. It's marked as read-only. As soon as you write to it, the processor raises an exception. The kernel's exception handler then grabs a page of RAM (possibly swapping something else out), fills it with zeroes, and maps it into your process's address space. The "calloc" function can exploit this.

(Actually, the kernel can go a step further, and have "mmap" do nothing to your process's memory until you read from it.)

The "memset" implementation touches every page in the allocation, resulting in much higher memory usage — it forces the kernel to allocate those pages now, instead of waiting until you actually use them.

The "calloc" implementation just changes a few page tables, consumes very little actual memory, writes to very little memory, and returns. On most systems you can even allocate more memory than your system can support (more than RAM + swap) without problems, as long as you don't write to all of it. (This feature is slightly controversial on the operating systems that allow this.)

Some systems do not support virtual memory: very old ones (think 80286) and some embedded systems. On these systems, the speeds might be much closer.

There are a few guesses in other answers that hypothesize that "memset" is slower than "calloc" because "memset" can't assume the memory is aligned. Here is how a typical "memset" implementation works:

function memset(dest, c, len)
    // one byte at a time, until the dest is aligned...
    while (len > 0 && ((unsigned int)dest & 15))
        *dest++ = c
        len -= 1
    // now write big chunks at a time (processor-specific)...
    // block size might not be 16, it's just pseudocode
    while (len >= 16)
        // some optimized vector code goes here
        // glibc uses SSE2 when available
        dest += 16
        len -= 16
    // the end is not aligned, so one byte at a time
    while (len > 0)
        *dest++ = c
        len -= 1

In a 256 MiB chunk, that first and last loop are going to be negligible, and the middle loop is the same as the hypothetical calloc loop. Some compilers inline "memset", and can even infer that the result of "malloc" is an aligned block. And a typical "calloc" implementation just calls "memset" anyway — "calloc" is usually written in C, and it's often portable across operating systems.

The other guess I saw was that "malloc" already initializes the memory, so "memset" initializes it twice. This is technically true in this case. However, it would only account for about a factor of two in speed. The "calloc" version is ten to fifteen hundred times faster. The numbers do not support the conclusion.

Footnote: Just for giggles, I timed the two programs on two of my computers. On my OS X / PowerPC box, the memset version was over 1500x slower (8 s versus 5 ms). On my Linux / x86 box, the memset version ran for 35x as long before segfaulting (expected, that computer has less RAM — note though that the calloc version didn't crash).

Dietrich Epp 2010-04-22 05:54:43

@Dietrich: the virtual memory explaination of Dietrich about OS allocating the same zero filled page many times for calloc is easy to check. Just add some loop that write junk data in every allocated memory page (writing one byte every 500 bytes should be enough). The overall result should then become much closer as system would be forced to really allocate differents pages in both cases.

kriss 2010-04-22 06:43:51

@kriss: indeed, although one byte every 4096 is sufficient on the vast majority of systems

Dietrich Epp 2010-04-22 06:46:57

Good to see someone with real OS knowledge answering this question.

rjh 2010-04-22 07:00:38

Answer 5

+2 A:

Probably you will find that calloc() knows that the space it allocated is a multiple of 8 bytes allocated on an 8 byte boundary, so it can use an appropriately long integer assignment to zero many bytes at a time whereas memset() is more constrained because it can be called to set odd numbers of bytes at odd offsets and so on.

Jonathan Leffler 2010-04-22 05:57:07

Incorrect. I've looked at memset implementations, and I can assure you that they do not suffer from performance penalties when writing to non-aligned regions: they check for alignment and handle the beginning/end of the region specially. For a 256 meg block, the difference would be so small you couldn't measure it.

Dietrich Epp 2010-04-22 06:05:30

@Dietrich - Assuming `memset()` is implemented well. Which should be a reasonable assumption.

Chris Lutz 2010-04-22 06:08:44

ansaurus

tags:

views:

answers:

why malloc+memset slower than calloc?

Optimisation behind the scenes??

related questions