Aren't misaligned pointers (in the BEST possible case) supposed to slow down performance and in the worst case crash your program (assuming the compiler was nice enough to compile your invalid c program).

Well, the following code doesn't seem to have any performance differences between the aligned and misaligned versions. Why is that?

/* brutality.c */

    xs = (unsigned long *) ((unsigned char *) xs + 1);


/* main.c */

#include <stdio.h>
#include <stdlib.h>

#define size_t_max ((size_t)-1)
#define max_count(var) (size_t_max / (sizeof var))

int main(int argc, char *argv[]) {

    unsigned long sum, *xs, *itr, *xs_end;
    size_t element_count = max_count(*xs) >> 4;

    xs = malloc(element_count * (sizeof *xs));
    if(!xs) exit(1);

    xs_end = xs + element_count - 1; sum = 0;

    for(itr = xs; itr < xs_end; itr++)
        *itr = 0;

#include "brutality.c"

    itr = xs;
    while(itr < xs_end)
        sum += *itr++;

    printf("%lu\n", sum);

    /* we could free the malloc-ed memory here */
    /* but we are almost done                  */

Compiled and tested on two separate machines using

gcc -pedantic -Wall -O0 -std=c99 main.c
for i in {0..9}; do time ./a.out; done

x86 or x64?

Misaligned pointers were a killer in x86 where 64bit architectures were not nearly as prone to the crash, or even slow performance at all.

Tested on "AMD Athlon(tm) XP 2700+" and on an "Intel(R) Xeon(R) CPU 5150 @ 2.66GHz" ... but really if you have something different feel free to post your results.
Elite Mx
+2  A: 

The x86 architecture has always been able to handle misaligned accesses, so you'll never get a crash. Other processors might not be as lucky.

You're probably not seeing any time difference because the loop is memory-bound; it can only run as fast as data can be fetched from RAM. You might think that the misalignment will cause the RAM to be accessed twice, but the first access puts it into cache, and the second access can be overlapped with getting the next value from RAM.

Mark Ransom
This seems like the most sensible answer. In which case, what processor doesn't cache the overlapping memory addresses, in which case ... when would there be any relevant performance difference? Either that or my test sucks.
Elite Mx
@EliteMx: "when would there be any relevant performance difference?" If your application accesses really lots of memory, the redundantly fetched data (and code too!) simply decrease effective cache size. Frankly, you better read the manual for your CPU. I presume you are on Intel - IA-32 or EM64T - and Intel has rather excellent documentation on that, e.g. .

It is probably because malloc of that many bytes is returning NULL. At least that's what it does for me.

Geoff Reedy
My code gets to the printf statement (ie: malloc succeeds) ... you could try shifting by 5,6 or 7 bits instead of 4.
Elite Mx

You never defined BRUTALITY in your posted code. Are you sure you are testing in 'brutal' mode?

R Samuel Klatchko
Well, of course. If I had defined it you would say ... but you never tested without BRUTALITY mode :P
Elite Mx

Maybe in order to malloc such a huge buffer, the system is paging memory to and from disk. That could swamp small differences. Try a much smaller buffer and a large, in program loop count around that.

I made the mods I've suggested here and in the comments and tested on my system (a tired, 4 year old, 32 bit laptop). Code shown below. I do get a measurable difference, but only around 3%. I maintain my changes are a success because your question indicates you get no difference at all correct ?

Sorry I am using Windows and used the windows specific GetTickCount() API I am familiar with because I often do timing tests, and enjoy the simplicity of that misnamed API (it actually return millisecs since system start).

/* main.cpp */

#include <stdio.h>
#include <stdlib.h>
#include <windows.h>


int main(int argc, char *argv[]) {
    unsigned long i, begin, end;
    unsigned long sum, *xs, *itr, *xs_begin, *xs_end;
    size_t element_count = 100000;

    xs = (unsigned long *)malloc(element_count * (sizeof *xs));
    if(!xs) exit(1);
    xs_end = xs + element_count - 1;
    #ifdef BRUTALITY
    xs_begin = (unsigned long *) ((unsigned char *) xs + 1);
    xs_begin = xs;

    begin = GetTickCount();
    for( i=0; i<50000; i++ )
        for(itr = xs_begin; itr < xs_end; itr++)
            *itr = 0;

        sum = 0;
        itr = xs_begin;
        while(itr < xs_end)
            sum += *itr++;
    end = GetTickCount();

    printf("sum=%lu elapsed time=%lumS\n", sum, end-begin );

Bill Forster
+1  A: 

I tested this some time in the past on Win32 machines and did not notice much of a penalty on 32-bit machines. On 64-bit, though, it was significantly slower. For example, I ran the following bit of code. On a 32-bit machine, the times printed were hardly changed. But on a 64-bit machine, the times for the misaligned accesses were nearly twice as long. The times follow the code.

#define UINT unsigned __int64
#define ENDPART QuadPart
#define UINT unsigned int
#define ENDPART LowPart

int main(int argc, char *argv[])
   LARGE_INTEGER startCount, endCount, freq;
   int i;
   int offset;
   int iters = atoi(argv[1]);
   char *p = (char*)malloc(16);
   double *d;

   for ( offset = 0; offset < 9; offset++ )
      d = (double*)( p + offset );
      printf( "Address alignment = %u\n", (unsigned int)d % 8 );
      *d = 0;
      for(i = 0; i < iters; ++i)
         *d = *d + 1.234;

      printf( "Time:  %lf\n",
             (double)(endCount.ENDPART-startCount.ENDPART)/freq.ENDPART );

Here are the results on a 64-bit machine. I compiled the code as a 32-bit application.

[P:\t]pointeralignment.exe 100000000
Address alignment = 0
Time:  0.484156
Address alignment = 1
Time:  0.861444
Address alignment = 2
Time:  0.859656
Address alignment = 3
Time:  0.861639
Address alignment = 4
Time:  0.860234
Address alignment = 5
Time:  0.861539
Address alignment = 6
Time:  0.860555
Address alignment = 7
Time:  0.859800
Address alignment = 0
Time:  0.484898
Mark Wilkins
Brilliant. This inspires me to test out more machines, including virtual hosts and clients of different architectures.
Elite Mx
+1  A: 

You're assuming either x86 or x64 architectures. On MIPS, for example, your code may result in a SIGBUS(bus fault) signal being raised. On other architectures, non-aligned accesses will typically be slower than aligned accesses, although, it is very much architecture dependent.

Nathan Ernst
Nope. I'm not assuming anything. Read the first sentence I wrote. In fact seeing it in action (the bus fault, the slow access) would make me very happy. I am also going to test out the results on an ARM CPU [N900].
Elite Mx