views:

179

answers:

4

Hi All,

I am trying to figure out how many clock cycles or total instructions it takes to access a pointer in C. I dont think I know how to figure out for example, p->x = d->a + f->b

i would assume two loads per pointer, just guessing that there would be a load for the pointer, and a load for the value. So in this operations, the pointer resolution would be a much larger factor than the actual addition, as far as trying to speed this code up, right?

This may depend on the compiler and architecture implemented, but am I on the right track?

I have seen some code where each value used in say, 3 additions, came from a

 f2->sum = p1->p2->p3->x + p1->p2->p3->a + p1->p2->p3->m

type of structure, and I am trying to define how bad this is

+7  A: 

This depends on the architecture at hand.

Some architectures can reference/dereference memory for an instruction without first loading it into a register, others don't. Some architectures don't have the notion of instructions that compute the offsets for you to dereference and will make you load the memory address, add your offset to it, and then allow you to dereference the memory location. I'm sure there are more variances chip-to-chip.

Once you get past these, each instruction takes varying amount of time depending on the architecture as well. To be honest though, it's an overhead that is very, very minimal.

For your immediate question of dereferencing a chain of items, the slowness will come in the fact that there is likely a poor locality of reference the farther you go in a dereferencing chain. This means more cache misses, which means more hits to main memory (or disk!) to get the data. Main memory is very slow compared to the CPU.

San Jacinto
+1 for mentioning cache implications
Robert Fraser
I dont think it is minimal. In optimizing code like the above, I have seen 3 - 8x speedups getting rid of the pointers and using normal array access. The problem is even worse if the pointers are actually structures.
Derek
@derekWell first of all, it's only a potentially bad overhead if the code is constantly executed, in which case, unless you are trashing the cache, continous memory lookups should be cached in the DTLB (in the case of x86). It's still nice to use registers when possible, which is what the compiler _does_. The example in my answer shows that there can be pointer access even when assigning local variables to each other.
Longpoke
@Derek: That'll be from improving the cache locality.
Donal Fellows
@derek I don't buy it. On an even playing field, it's literally another instruction or two and are in fact usually very similar. Usually, you'll load the starting spot of the array (or the value of the pointer), and from there on you're simply computing offsets with something like an index register. This happens for both pointers and arrays through pointers. If you're talking about a linked-list or tree, then that's a locality problem.
San Jacinto
@San Jacinto I wrote a very simple program with 2 three structures. When compiled with no optimizations on the MIPS compiler -S option. The assembler for p.something->nextthing->value was calculated as three loads, every time it was accessed. The compiler did not store nextthing and just access the values from there.
Derek
@derek First, you've got to use the optimizations or the test is completely worthless. Secondly, your test tells us nothing. If you simply newed or malloc'd memory for a multi-dimensional array, then yes of course it's going to be slower. This is a cache locality problem because the memory at the 3rd (or 2nd, for that matter) is probably located nowhere near where the prior one was, where as an array implementation will have all of the memory in the same locality (or else it wouldn't be an array!).
San Jacinto
@derek furthermore, I don't know the MIPS architecture, but I'm TELLING you, the mechanisms for array access and pointer access are essentially the same after the dereference. If the memory you're trying to access is in the cache already, you're not going to see a big difference in execution time. The converse is that memory I/O is a huge bottleneck for this test if you're using the pointers wrong (ie, continually dereferencing discontiguous blocks of memory). This isn't a slowness in pointer dereferencing, it's a slowness in programming.
San Jacinto
@San Jacinto: Just because something is 1 instruction does not mean it is as fast as any other instruction. A load instruction, even from L1 cache, takes several clock cycles on just about any modern architecture (compared to just 1 cycle for a simple arithmetic operation on a few registers). -1 for calling that "very, very minimal" in the case of comparing these loads to an add, the difference is huge.
SoapBox
@soapbox ...you'd be hitting the cache whether or not you used the pointer or array method. I'm not following you.
San Jacinto
You seem to be implying that because memory dereference is only 1 instruction, that when it hits the L1 cache there is no penalty. That is not the case. A load, even from L1 cache, takes several instruction cycles on most modern CPUs including x86 and MIPS.
SoapBox
@SoapBox I'm implying no such thing. I think my answer says the opposite, actually. What I AM saying is that access to L1 is the common denominator. Either method of array access has the same limitation and THIS IS YOUR BOTTLENECK, not an extra 5 cycles each iteration of a loop (http://en.wikipedia.org/wiki/Clock_cycle). This means that a load each iteration of a loop for a dereference (but you shouldn't have to dereference each iteration of the loop), will add abut 1+1/4 of a second of overhead for every 100-million dereferences on a 40mhz MIPS. Not what I call astounding overhead.
San Jacinto
@SoapBox obviously, this is only extra overhead for the load. This doesn't factor in the common time to compute the offset and do the actual memory access after the computation, but this is common to both methods.
San Jacinto
@derek if you see my comment that is directly two above this one, it may give you a better understanding of what kind of overhead a dereference adds.
San Jacinto
Sorry this is so much later. I should add some context, that I am running software on super computers here, not a workstation. We have a benchmark here now of a simple triangle area calculation, going from raw array data, to fully pointerized virtual function classes, and the slowdown is significant. Adding the data values to a simple structure had a 16% slowdown calculating 1M triangle areas. That is not what i'd call insignificant. See below:
Derek
Here is the loop.for (i=0; i < 1000000; i++) { a = triangles[i].p1[0] - triangles[i].p2[0]; b = triangles[i].p1[1] – triangles[i].p2[1]; c = triangles[i].p1[2] – triangles[i].p2[2]; d = triangles[i].p1[0] – triangles[i].p3[0]; e = triangles[i].p1[1] – triangles[i].p3[1]; f = triangles[i].p1[2] – triangles[i].p3[2]; crossX = b * f - c * e; crossY = c * d - a * f; crossZ = a * e - d * b; areas[i] = 0.5 * std::sqrt(crossX * crossX + crossY * crossY + crossZ * crossZ);}
Derek
I do realize the difference between a dereference and a cache miss, but my point is, without doing explicit programming to avoid locality or dereferencing issues, you cannot broadly state that there is no impact, or that the compiler will take care of it.
Derek
@Derek That context is indeed helpful. Part of the issue is with the "fully pointerized virtual functions." Every time you call a virtual function, you force a lookup in the VTABLE. This is much more overhead than a simple dereference. i hate to be a stickler, but there isn't much you've shown or said that indicates why dereferencing a pointer should be any slower. I don't doubt that in your case, there was a speedup by switching. I doubt that you are comparing apples to apples, here (as you hint in your last comment). Nonetheless, this is an interesting discussion. Thanks.
San Jacinto
+1  A: 

Depends what you are doing, a trivial pointer dereference y = *z; where

int x = 1;
int* z = &x;
int y;

might assemble to something like this on the x86:

mov eax, [z]
mov eax, [eax]
mov [y], eax

and y = x would still take a memory dereference:

mov eax, [x]
mov [y], eax

Mov instructions to memory take about 2-4 cycles IIRC.

Although, if you are loading memory from completely random locations, you will be causing a lot of page faults, resulting in hundreds of clock cycles being wasted.

Longpoke
+2  A: 

Some IDEs like VisualStudio allow you to view the assembly generated along with the source code.

http://stackoverflow.com/questions/1020498/how-to-view-the-assembly-behind-the-code-msvc-if-relevent

Then you can see for your exact architecture and implementation what it looks like.

If you are using GDB (linux, mac) use disassemble

(gdb) disas 0x32c4 0x32e4
Dump of assembler code from 0x32c4 to 0x32e4:
0x32c4 <main+204>:      addil 0,dp
0x32c8 <main+208>:      ldw 0x22c(sr0,r1),r26
0x32cc <main+212>:      ldil 0x3000,r31
0x32d0 <main+216>:      ble 0x3f8(sr4,r31)
0x32d4 <main+220>:      ldo 0(r31),rp
0x32d8 <main+224>:      addil -0x800,dp
0x32dc <main+228>:      ldo 0x588(r1),r26
0x32e0 <main+232>:      ldil 0x3000,r31
End of assembler dump.
Robert
I compiled iwth the -S option, and I found something very similar to what others have talked about.
Derek
+1  A: 

Where it can, the compiler will remove that overhead for you by keeping repeatedly-used base locations in a register (eg. p1->p2->p3 in your example).

However, sometimes the compiler can't determine which pointers might alias other pointers used within your function - which means that it has to fall back to a very conservative position, and reload values from pointers frequently.

This is where C99's restrict keyword can help. It lets you inform the compiler when certain pointers are never aliased by other pointers in the scope of the function, which consquently can improve the optimisation.


For example, take this function:

struct xyz {
    int val1;
    int val2;
    int val3;
};

struct abc {
    struct xyz *p2;
};

int foo(struct abc *p1)
{
    int sum;

    sum = p1->p2->val1 + p1->p2->val2 + p1->p2->val3;

    return sum;
}

Under gcc 4.3.2 with optimisation level -O1, it compiles to this x86 code:

foo:
    pushl   %ebp
    movl    %esp, %ebp
    movl    8(%ebp), %eax
    movl    (%eax), %edx
    movl    4(%edx), %eax
    addl    (%edx), %eax
    addl    8(%edx), %eax
    popl    %ebp
    ret

As you can see, it only deferences p1 once - it keeps the value of p1->p2 in the %edx register and uses it three times to fetch the three values from that structure.

caf
I actually wrote a test program, compiled with the -S option, and I found that even for a simple case such as p1.p2->p3->value or some such thing, it reloaded from p1 every time. Very conservative with no optimizations
Derek
@Derek: What optimisation level did you use? With `-O1` or higher, it should optimise the simple cases quite well (see the example I've added to my answer).
caf
Yes, it will, but it loses some of that ability the more complex a program gets.This is kind of my point
Derek