views:

2393

answers:

10

I keep reading that, in C, using pointer arithmetic is generally faster than subscripting for array access. Is this true even with modern (supposedly-optimizing) compilers?

If so, is this still the case as I begin to move away from learning C into Objective-C and Cocoa on Macs? (As is the primary goal of my picking up C -- get enough that I can be effective in Objective-C.)

Which is the preferred coding style for array access, in both C and Objective-C? Which is considered (by professionals of their respective languages) more legible, more "correct" (for lack of a better term)?

+42  A: 

You need to understand the reason behind this claim. Have you ever questioned yourself why it is faster? Let's compare some code:

int i;
int a[20];

// Init all values to zero
memset(a, 0, sizeof(a));
for (i = 0; i < 20; i++) {
    printf("Value of %d is %d\n", i, a[i]);
}

They are all zero, what a surprise :-P The question is, what means a[i] actually in low level machine code? It means

  1. Take the address of a in memory.

  2. Add i times the size of a single item of a to that address (int usually is 4 byte).

  3. Fetch the value from that address.

So each time you fetch a value from a, the base address of a is added to the result of the multiplication of i by 4. If you just dereference a pointer, step 1. and 2. don't need to be performed, only step 3.

Consider the code below:

int i;
int a[20];
int * b;

memset(a, 0, sizeof(a));
b = a;
for (i = 0; i < 20; i++) {
    printf("Value of %d is %d\n", i, *b);
    b++;
}

This code might be faster... but even if it is, the difference is tiny. Why might it be faster? "*b" is the same as step 3. of above. However, "b++" is not the same as step 1. and step 2. "b++" will increase the pointer by 4.

(important for newbies: running ++ on a pointer will not increase the pointer one byte in memory! It will increase the pointer by as many bytes in memory as the data it points to is in size. It points to an int and the int is 4 bytes on my machine, so b++ increases b by four!)

Okay, but why might it be faster? Because adding 4 to a pointer is faster than multiplying i by 4 and adding that to a pointer. You have an addition in either case, but in the second one, you have no multiplication (you avoid the CPU time needed for one multiplication). Considering the speed of modern CPUs, even if the array was 1 mio elements, I wonder if you could really benchmark a difference, though.

If a modern compiler can optimize either one to be equally fast is something you can check by looking at the assembly output it produces. You do so by passing the "-S" option (capital S) to gcc.

Here's the code of first C code (optimization level -Os has been used, which means optimize for code size and speed, but don't do speed optimizations that will increase code size noticeably, unlike -O2 and much unlike -O3):

_main:
    pushl %ebp
    movl %esp, %ebp
    pushl %edi
    pushl %esi
    pushl %ebx
    subl $108, %esp
    call ___i686.get_pc_thunk.bx
"L00000000001$pb":
    leal -104(%ebp), %eax
    movl $80, 8(%esp)
    movl $0, 4(%esp)
    movl %eax, (%esp)
    call L_memset$stub
    xorl %esi, %esi
    leal LC0-"L00000000001$pb"(%ebx), %edi
L2:
    movl -104(%ebp,%esi,4), %eax
    movl %eax, 8(%esp)
    movl %esi, 4(%esp)
    movl %edi, (%esp)
    call L_printf$stub
    addl $1, %esi
    cmpl $20, %esi
    jne L2
    addl $108, %esp
    popl %ebx
    popl %esi
    popl %edi
    popl %ebp
    ret

Same with the second code:

_main:
    pushl %ebp
    movl %esp, %ebp
    pushl %edi
    pushl %esi
    pushl %ebx
    subl $124, %esp
    call ___i686.get_pc_thunk.bx
"L00000000001$pb":
    leal -104(%ebp), %eax
    movl %eax, -108(%ebp)
    movl $80, 8(%esp)
    movl $0, 4(%esp)
    movl %eax, (%esp)
    call L_memset$stub
    xorl %esi, %esi
    leal LC0-"L00000000001$pb"(%ebx), %edi
L2:
    movl -108(%ebp), %edx
    movl (%edx,%esi,4), %eax
    movl %eax, 8(%esp)
    movl %esi, 4(%esp)
    movl %edi, (%esp)
    call L_printf$stub
    addl $1, %esi
    cmpl $20, %esi
    jne L2
    addl $124, %esp
    popl %ebx
    popl %esi
    popl %edi
    popl %ebp
    ret

Well, it's different, that's for sure. The 104 and 108 number difference comes of the variable b (in the first code there was one variable less on stack, now we have one more, changing stack addresses). The real code difference in the for loop is

movl    -104(%ebp,%esi,4), %eax

compared to

movl    -108(%ebp), %edx
movl    (%edx,%esi,4), %eax

Actually to me it rather looks like the first approach is faster(!), since it issues one CPU machine code to perform all the work (the CPU does it all for us), instead of having two machine codes. On the other hand, the two assembly commands below might have a lower runtime altogether than the one above.

As a closing word, I'd say depending on your compiler and the CPU capabilities (what commands CPUs offer to access memory in what way), the result might be either way. Either one might be faster/slower. You cannot say for sure unless you limit yourself exactly to one compiler (meaning also one version) and one specific CPU. As CPUs can do more and more in a single assembly command (ages ago, a compiler really had to manually fetch the address, multiply i by four and add both together before fetching the value), statements that used to be an absolute truth ages ago are nowadays more and more questionable. Also who knows how CPUs work internally? Above I compare one assembly instructions to two other ones.

I can see that the number of instructions is different and the time such an instruction needs can be different as well. Also how much memory these instructions needs in their machine presentation (they need to be transferred from memory to CPU cache after all) is different. However modern CPUs don't execute instructions the way you feed them. The split big instructions (often referred to as CISC) into small sub-instructions (often referred to as RISC), which also allows them to better optimize program flow for speed internally. In fact the first, single instruction and the two other instructions below might result in the same set of sub-instructions, in which case there is no measurable speed difference whatsoever.

Regarding Obj-C, it is just C with extensions. So everything that holds true for C will hold true for Obj-C as well in terms of pointers and arrays. If you use Objects on the other hand (e.g. a NSArray or NSMutableArray), this is a completely different beast. However in that case you must access these arrays with methods anyway, there is no pointer/array access to choose from.

Mecki
Memset() works fine on the array variable, too. It's just a pointer after all.
Bob Somers
Yay! The most comprehensive answer that I read lately.
tafa
Yeah, I've learned a lot from all the answers to this question ... But this one, I think, takes the taco.
John Rudy
Very nice answer indeed!
EFrank
+2  A: 

If you're dealing with array-type data, I'd say using subscripts makes the code more readable. On today's machines (especially for something simple like this), readable code is more important.

Now if you're dealing explicitly with a chunk of data you malloc()'d and you want to get a pointer inside that data, say 20 bytes inside a audio file header, then I think address arithmetic more clearly expresses what you're trying to do.

I'm not sure about compiler optimizations in this regard, but even if subscripting is slower it's only slower by maybe a few clock cycles at the most. That's hardly anything when you can gain so much more from the clarity of your train of thought.

EDIT: According to some of these other responses, subscripting is just a syntacitic element and has no effect on performance like I figured. In that case, definitely go with whatever context you're trying to express through access data inside the block pointed to by the pointer.

Bob Somers
+6  A: 

"using pointer arithmetic is generally faster than subscripting for array access"

Nah. It's the same operation either way. Subscripting is syntactic sugar for adding (element size * index) to the array's start address.

That said, when iterating over the elements in an array, taking a pointer to the first element and increasing it each time through the loop will usually be slightly faster than calculating the current element's position from the loop variable each time. (Though it is unusual for this to matter much in a real life application. Examine your algorithm first, premature optimisation is the root of all evil, etc etc)

moonshadow
That's what I'd suspected; thanks!
John Rudy
except the compiler will often turn for (i = 0; i < 20; i++) a[i] ++; into something that steps through the array with a pointer, so generally don't bother unless you think the compiler can't see what you're doing and do the transformation for you, also sometimes indexing is faster...
Spudd86
A: 

It's not true. It's exactly as fast as with subscript operators. In Objective-c you can use arrays like in C and in object-oriented style where object-oriented style is a lot slower because it makes some operations in every call due to dynamic nature of calling.

JtR
Biggest performance optimizations can be found in most cases from better algorithms. Not in language or low-level optimizations.
JtR
A: 

It's unlikely that there will be any difference in speed.

Using the array operator [] is probably preferred, as in C++ you can use the same syntax with other containers (e.g. vector).

MrZebra
Or you can use the same pointer syntax with iterators :)
quinmars
+1  A: 
char p1[ ] = "12345";
char* p2 = "12345";

char *ch = p1[ 3 ]; /* 4 */
ch = *(p2 + 3); /* 4 */

The C standard doesn't say which is faster. On the observable behavior is same and it is up to compiler to implement it in any way it wants. More often than not it won't even read memory at all.

In general, you have no way to say which is "faster" unless you specify a compiler, version, architecture, and compile options. Even then, optimization will depend on the surrounding context.

So the general advice is to use whatever gives clearer and simpler code. Using array[ i ] gives some tools ability to try and find index-out-of-bound conditions, so if you are using arrays, it's better to just treat them as such.

If it is critical - look into assembler that you compiler generates. But keep in mind it may change as you change the code that surrounds it.

n-alexander
Neither C or C++ standard says which is faster because standards are not implementations. Particular implementation of C or C++ is responsible for performance of execution of final program. In your example, use of indexing operator performs else more but pointer arithmetic (http://stackoverflow.com/questions/2124935/c-strings-confusion/2125035#2125035) and most if not all compilers today do generate the same code.
mloskot
+4  A: 

This might be a bit off topic (sorry) because it doesn't answer your question regarding execution speed, but you should consider that premature optimization is the root of all evil (Knuth). In my opinion, specially when still (re)learning the language, by all means write it the way it is easiest to read first. Then, if your program runs correct, consider optimizing speed. Most of the time you code will be fast enough anyway.

TheMarko
In practice, I agree with you. However, what I'm looking for is more sage advice from others who are pros in a given language, why I might care to do something one way over another, and what downstream ramifications there might be ...
John Rudy
not only that but the compiler knows more about the CPU than you do so it can make a better decision, frequently if you refer to multiple elements in the array each loop iteration it may be faster to compute addresses on the fly, especially if you use the index for more than just indexing. (And on x86 you have instructions the do index and load at the same time...) Basically leave it up to the compiler, unless it's a performance critical loop (and you profiled so you KNOW it's the right one) then try both ways and profile again to see which is faster.
Spudd86
+2  A: 

Please keep in mind that execution speed is hard to predict even when looking at the machine code with superscalar cpus and the like with

  • out of order exection
  • pipelining
  • branch prediction
  • hyperthreading
  • ...

It's not just counting machine instructions and not even only counting clock cylces. Seems easier just to measure in cases where really necessary. Even if it's not impossible to calculate the correct cycle count for a given program (we had to do it in university) but it's hardly fun and hard to get right. Sidenote: Measuring correctly is also hard in multithreaded / mulit-processor environments.

TheMarko
+4  A: 

Mecki has a great explanation. From my experience, one of the things that often matters with indexing vs. pointers is what other code sits in the loop. Example:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <iostream>

using namespace std;

typedef int64_t int64;
static int64 nsTime() {
  struct timespec tp;
  clock_gettime(CLOCK_REALTIME, &tp);
  return tp.tv_sec*(int64)1000000000 + tp.tv_nsec;
}

typedef int T;
size_t const N = 1024*1024*128;
T data[N];

int main(int, char**) {
  cout << "starting\n";

  {
    int64 const a = nsTime();
    int sum = 0;
    for (size_t i=0; i<N; i++) {
      sum += data[i];
    }
    int64 const b = nsTime();
    cout << "Simple loop (indexed): " << (b-a)/1e9 << "\n";
  }

  {
    int64 const a = nsTime();
    int sum = 0;
    T *d = data;
    for (size_t i=0; i<N; i++) {
      sum += *d++;
    }
    int64 const b = nsTime();
    cout << "Simple loop (pointer): " << (b-a)/1e9 << "\n";
  }

  {
    int64 const a = nsTime();
    int sum = 0;
    for (size_t i=0; i<N; i++) {
      int a = sum+3;
      int b = 4-sum;
      int c = sum+5;
      sum += data[i] + a - b + c;
    }
    int64 const b = nsTime();
    cout << "Loop that uses more ALUs (indexed): " << (b-a)/1e9 << "\n";
  }

  {
    int64 const a = nsTime();
    int sum = 0;
    T *d = data;
    for (size_t i=0; i<N; i++) {
      int a = sum+3;
      int b = 4-sum;
      int c = sum+5;
      sum += *d++ + a - b + c;
    }
    int64 const b = nsTime();
    cout << "Loop that uses more ALUs (pointer): " << (b-a)/1e9 << "\n";
  }
}

On a fast Core 2-based system (g++ 4.1.2, x64), here's the timing:

    Simple loop (indexed): 0.400842
    Simple loop (pointer): 0.380633
    Loop that uses more ALUs (indexed): 0.768398
    Loop that uses more ALUs (pointer): 0.777886

Sometimes indexing is faster, sometimes pointer arithmetic is. It depends on the how the CPU and compiler are able to pipeline the loop execution.

Mr Fooz
+1  A: 

No, using pointer arithmetic is not faster and most probably slower, Because an optimizing compiler may utilize instructions like LEA (Load Effective Address) on intel processors or similar on other processors for pointer arithmetic which is faster than add or add/mul. It has the advantage of doing several things at once and NOT effecting the flags and it also takes 1 cycle to compute. BTW below is from the GCC manual. So -Os does not optimize primarily for speed.

I also completely agree with themarko. First try to write clean/readable/reusable code then think about optimization and use some profiling tools to find the bottleneck. Most of the time the performance problem is IO related or some bad algorithm or some bug that you have to hunt down. Knuth is the man ;-)

It just occurred to me that what will you do with a structure array. If you want to do pointer arithmetic, then you definitely should do it for each member of the struct. Does it sound like overkill ? Yes, of course it is overkill and also it opens a wide door to obscure bugs.

-Os Optimize for size. Os enables all O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.

Malkocoglu