views:

378

answers:

6

I'm looking to see what can a programmer do in C, that can determine the performance and/or the size of the generated object file.

For e.g, 1. Declaring simple get/set functions as inline may increase performance (at the cost of a larger footprint) 2. For loops that do not use the value of the loop variable itself, count down to zero instead of counting up to a certain value etc.

It looks like compilers now have advanced to a level where "simple" tricks (like the two points above) are not required at all. Appropriate options during compilation do the job anyway. Heck, I also saw posts here on how compilers handle recursion - that was very interesting! So what are we left to do at a C level then? :)

My specific environment is: GCC 4.3.3 re-targetted for ARM architecture (v4). But responses on other compilers/processors are also welcome and will be munched upon.

PS: This approach of mine goes against the usual "code first!, then benchmark, and finally optimize" approach.

Edit: Just like it so happens, I found a similar post after posting the question: http://stackoverflow.com/questions/763656/should-we-still-be-optimizing-in-the-small

+5  A: 

One thing I can think of that a compiler probably won't optimize is "cache-friendliness": If you're iterating over a two-dimensional array in row-major order, say, make sure your inner loop runs across the column index to avoid cache thrashing. Having the inner loop run over the wrong index can cause a huge performance hit.

This applies to all programming languages, but if you're programming in C, performance is probably critical to you, so it's especially relevant.

Martin B
AFAIK ARM v4 processors don't have a cache.
starblue
Depends... some members of the family do, some don't. (http://en.wikipedia.org/wiki/ARM_architecture)
Martin B
+3  A: 

"Always" know the time and space complexity of your algorithms. The compiler will never be able to do that job as well as you can. :)

280Z28
+1  A: 

PreComputation where possible... (sorry but its not always possible... I did extensive precomputation on my chess engine.) Store those results in memory, keeping cache in mind.. the bigger the size of precomputation data in memory the lesser is the chance of doing a cache hit. Since most of recent hardware is multicore you can design your application to target it.

if you are using several big arrays make sure you group them close to each other on where they would be used, boosting cache hits

Umair Ahmed
+2  A: 

Compilers these days still aren't very good at vectorizing your code so you'll still want to do the SIMD implementation of most algorithms yourself.

Choosing the right datastructures for your exact problem can dramatically increase performance (I've seen cases where moving from a Kd-tree to a BVH would do that, in that specific case).

Compilers might pad some structs/ variables to fit into the cache but other cache optimizations such as the locality of your data are still up to you.

Compilers still don't automatically make your code multithreaded and using openmp, in my experience, doesn't really help much. (You really have to understand openmp anyway to dramatically increase performance). So currently, you're on your own doing multithreading.

Jasper Bekkers
+1  A: 

To add to what Martin says above about cache-friendliness:

  • reordering your structures such that fields which are commonly accessed together are in the same cache line can help (for instance by loading just one cache line rather than two.) You are essentially increasing the density of useful data in your data cache by doing this. There is a linux tool which can help you in doing this: dwarves [1]. http://www.linuxinsight.com/files/ols2007/melo-reprint.pdf

  • you can use a similar strategy for increasing density of your code. In gcc you can mark hot and cold branches using likely/unlikely tags. That enables gcc to keep the cold branches separately which helps in increasing the icache density.

And now for something completely different:

  • for fields that might be accessed (read and written) across CPUs, the opposite strategy makes sense. The trouble is that for coherence purposes only one CPU can be allowed to write to the same address (in reality the same cacheline.) This can lead to a condition called cache-line ping pong. This is pretty bad and could be worse if that cache-line contains other unrelated data. Here, padding this contended data to a cache-line length makes sense.

Note: these clearly are micro-optimizations, to be done only at later stages when you are trying to wring the last bits of performance from your code.

terminus
A: 

Many people are not aware of this: Define an inline label (varies by compiler) which means inline, in its intent - many compilers place the keyword in an entirely different context from the original meaning. There are also ways to increase the inline size limits, before the compiler begins popping trivial things out of line. Human directed inlining can produce much faster code (compilers are often conservative, or do not account for enough of the program), but you need to learn to use it correctly, because it can (easily) be counterproductive. And yes, this absolutely applies to code size as well as speed.

Justin