ansaurus

Question

Answer 1

A:

x86 or x64?

Misaligned pointers were a killer in x86 where 64bit architectures were not nearly as prone to the crash, or even slow performance at all.

Gnostus 2010-06-16 21:58:25

Tested on "AMD Athlon(tm) XP 2700+" and on an "Intel(R) Xeon(R) CPU 5150 @ 2.66GHz" ... but really if you have something different feel free to post your results.

Elite Mx 2010-06-16 22:04:39

Answer 2

+2 A:

The x86 architecture has always been able to handle misaligned accesses, so you'll never get a crash. Other processors might not be as lucky.

You're probably not seeing any time difference because the loop is memory-bound; it can only run as fast as data can be fetched from RAM. You might think that the misalignment will cause the RAM to be accessed twice, but the first access puts it into cache, and the second access can be overlapped with getting the next value from RAM.

Mark Ransom 2010-06-16 22:14:05

This seems like the most sensible answer. In which case, what processor doesn't cache the overlapping memory addresses, in which case ... when would there be any relevant performance difference? Either that or my test sucks.

Elite Mx 2010-06-16 22:26:43

@EliteMx: "when would there be any relevant performance difference?" If your application accesses really lots of memory, the redundantly fetched data (and code too!) simply decrease effective cache size. Frankly, you better read the manual for your CPU. I presume you are on Intel - IA-32 or EM64T - and Intel has rather excellent documentation on that, e.g. http://www.intel.com/products/processor/manuals/index.htm .

Dummy00001 2010-06-16 22:59:26

Answer 3

A:

It is probably because malloc of that many bytes is returning NULL. At least that's what it does for me.

Geoff Reedy 2010-06-16 22:15:18

My code gets to the printf statement (ie: malloc succeeds) ... you could try shifting by 5,6 or 7 bits instead of 4.

Elite Mx 2010-06-16 22:59:43

Answer 4

A:

You never defined BRUTALITY in your posted code. Are you sure you are testing in 'brutal' mode?

R Samuel Klatchko 2010-06-16 22:16:58

Well, of course. If I had defined it you would say ... but you never tested without BRUTALITY mode :P

Elite Mx 2010-06-16 22:23:26

Answer 5

A:

Maybe in order to malloc such a huge buffer, the system is paging memory to and from disk. That could swamp small differences. Try a much smaller buffer and a large, in program loop count around that.

I made the mods I've suggested here and in the comments and tested on my system (a tired, 4 year old, 32 bit laptop). Code shown below. I do get a measurable difference, but only around 3%. I maintain my changes are a success because your question indicates you get no difference at all correct ?

Sorry I am using Windows and used the windows specific GetTickCount() API I am familiar with because I often do timing tests, and enjoy the simplicity of that misnamed API (it actually return millisecs since system start).

/* main.cpp */

#include <stdio.h>
#include <stdlib.h>
#include <windows.h>

#define BRUTALITY

int main(int argc, char *argv[]) {
    unsigned long i, begin, end;
    unsigned long sum, *xs, *itr, *xs_begin, *xs_end;
    size_t element_count = 100000;

    xs = (unsigned long *)malloc(element_count * (sizeof *xs));
    if(!xs) exit(1);
    xs_end = xs + element_count - 1;
    #ifdef BRUTALITY
    xs_begin = (unsigned long *) ((unsigned char *) xs + 1);
    #else
    xs_begin = xs;
    #endif

    begin = GetTickCount();
    for( i=0; i<50000; i++ )
    {
        for(itr = xs_begin; itr < xs_end; itr++)
            *itr = 0;

        sum = 0;
        itr = xs_begin;
        while(itr < xs_end)
            sum += *itr++;
    }
    end = GetTickCount();

    printf("sum=%lu elapsed time=%lumS\n", sum, end-begin );

    free(xs);
    exit(0);
}

Bill Forster 2010-06-16 22:24:29

Answer 6

+1 A:

I tested this some time in the past on Win32 machines and did not notice much of a penalty on 32-bit machines. On 64-bit, though, it was significantly slower. For example, I ran the following bit of code. On a 32-bit machine, the times printed were hardly changed. But on a 64-bit machine, the times for the misaligned accesses were nearly twice as long. The times follow the code.

#define UINT unsigned __int64
#define ENDPART QuadPart
#else
#define UINT unsigned int
#define ENDPART LowPart
#endif


int main(int argc, char *argv[])
{
   LARGE_INTEGER startCount, endCount, freq;
   int i;
   int offset;
   int iters = atoi(argv[1]);
   char *p = (char*)malloc(16);
   double *d;

   for ( offset = 0; offset < 9; offset++ )
      {
      d = (double*)( p + offset );
      printf( "Address alignment = %u\n", (unsigned int)d % 8 );
      *d = 0;
      QueryPerformanceFrequency(&freq);
      QueryPerformanceCounter(&startCount);
      for(i = 0; i < iters; ++i)
         *d = *d + 1.234;
      QueryPerformanceCounter(&endCount);

      printf( "Time:  %lf\n",
             (double)(endCount.ENDPART-startCount.ENDPART)/freq.ENDPART );
      }
}

Here are the results on a 64-bit machine. I compiled the code as a 32-bit application.

[P:\t]pointeralignment.exe 100000000
Address alignment = 0
Time:  0.484156
Address alignment = 1
Time:  0.861444
Address alignment = 2
Time:  0.859656
Address alignment = 3
Time:  0.861639
Address alignment = 4
Time:  0.860234
Address alignment = 5
Time:  0.861539
Address alignment = 6
Time:  0.860555
Address alignment = 7
Time:  0.859800
Address alignment = 0
Time:  0.484898

Mark Wilkins 2010-06-16 22:37:04

Brilliant. This inspires me to test out more machines, including virtual hosts and clients of different architectures.

Elite Mx 2010-06-16 22:48:52

Answer 7

+1 A:

You're assuming either x86 or x64 architectures. On MIPS, for example, your code may result in a SIGBUS(bus fault) signal being raised. On other architectures, non-aligned accesses will typically be slower than aligned accesses, although, it is very much architecture dependent.

Nathan Ernst 2010-06-16 23:30:26

Nope. I'm not assuming anything. Read the first sentence I wrote. In fact seeing it in action (the bus fault, the slow access) would make me very happy. I am also going to test out the results on an ARM CPU [N900].

Elite Mx 2010-06-17 01:31:27

ansaurus

tags:

views:

answers:

Misaligned Pointer Performance

related questions