views:

239

answers:

3

I'm working on a fluid dynamics Navier-Stokes solver that should run in real time. Hence, performance is important.

Right now, I'm looking at a number of tight loops that each account for a significant fraction of the execution time: there is no single bottleneck. Most of these loops do some floating-point arithmetic, but there's a lot of branching in between.

The floating-point operations are mostly limited to additions, subtractions, multiplications, divisions and comparisons. All this is done using 32-bit floats. My target platform is x86 with at least SSE1 instructions. (I've verified in the assembler output that the compiler indeed generates SSE instructions.)

Most of the floating-point values that I'm working with have a reasonably small upper bound, and precision for near-zero values isn't very important. So the thought occurred to me: maybe switching to fixed-point arithmetic could speed things up? I know the only way to be really sure is to measure it, that might take days, so I'd like to know the odds of success beforehand.

Fixed-point was all the rage back in the days of Doom, but I'm not sure where it stands anno 2010. Considering how much silicon is nowadays pumped into floating-point performance, is there a chance that fixed-point arithmetic will still give me a significant speed boost? Does anyone have any real-world experience that may apply to my situation?

+5  A: 

Stick with floating point. Fixed point is really only useful if you can work within 8 bits or 16 bits and use SIMD (image processing and audio are typical use cases for this).

Modern CPUs typically have 2 FPUs and you can issue up to 2 FP instructions per clock cycle. You also then have the possibility of optimisation using 4 way FP SIMD (SSE).

If you're still struggling to get good performance then try using a better compiler, such as Intel's ICC. Also, 64-bit Intel executables tend to be somewhat faster than their 32-bit counterparts due to the increased number of registers in the 64-bit model, so build for 64-bit if you can.

And of course you should profile your code too, so that you know for certain where the hotspots are. You don't say what OS you're using but VTune on Windows, Zoom on Linux or Shark on Mac OS X will all help you to quickly and easily find your performance bottlenecks.

Paul R
A: 

Your machine is pretty well optimized for floating point, so you probably won't save much by going to fixed-point fractions.

You say there's no single bottleneck, but there may be multiple, and if you manage to shave any one of them, then the others will take larger percentages of the remaining time, drawing your attention to them, so you can shave them too.

You've probably done this, but I would make sure not only that the time-consuming functions are as quick as possible, but they are being called no more than necessary.

Mike Dunlavey
"... you probably won't save much by going to fixed-point fractions." Do you have any references or first-hand experience to back this up?
Thomas
@Thomas: Personal experience. I used fixed-point fractions extensively in graphics before FP processors were ubiquitous. I was at the Instrumentation Lab during Apollo, when the entire navigation system was done with fixed-point fractions. Now with on-chip FP hardware, where all the normalizing, adding, multiplying, and NaN detection is done as much as possible in combinatorial logic at full clock speed, I'm guessing, but I suspect multiplication and division will be in the same ball park, floating vs. fixed point + shifting.
Mike Dunlavey
A: 

As other people have said, if you're already using floating-point SIMD, I doubt you'll get much improvement with fixed point.

You said that the compiler is emitting SSE instructions, but it doesn't sound like you've tried writing your vectorized SSE code. I don't know how good the compilers usually are at that, but it's something to investigate.

Two other areas to look at are:

  1. Memory access - if all your computations are done in SSE, then cache misses might be taking up more time than the actual math.

    1. You can prefetch data with e.g. _mm_prefetch or __builtin_prefetch (depending on your compiler/platform).
    2. Check your expensive functions for aliasing between inputs and outputs; these can lead to extra memory reads/writes.
    3. Consider storing your data differently - if your fluid solver solvers for x coordinates independently of y's, it might be more cache friendly to store them in different arrays. If they're solved for together, consider interleaving (e.g. x y x y...)
  2. Unrolling - you should be able to get a performance benefit from unrolling your inner loops. The goal is not (as many people think) to reduce the number of loop termination checks. The main benefit is to allow independent instructions to be interleaved, to hide the instruction latency. There a presentation here entitled VMX Optimization: Taking it up a Level which might help a bit; it's focused on Altivec instructions on Xbox360, but some of the unrolling advice might help on SSE as well.

As other people have mentioned, profile, profile, profile. And then let us know what's still slow :)

PS - on one of your other posts here, I convinced you to use SOR instead of Gauss-Seidel in your matrix solver. Now that I think about it, is there a reason that you're not using a tri-diagonal solver?

celion
I tried my hand at a little bit of SSE assembly, but the compiler is better at it than I am. I had to do too much shuffling to get the values in the right places before performing 4 simultaneous multiplications. However, this was my first x86 assembly code ever, so maybe more performance can be squeezed out.
Thomas
My system is small enough to fit entirely into L2 cache; less than 400 kB, all together. Would it still be worthwhile to consider the L1 cache beyond that?
Thomas
Won't the compiler unroll loops at `-O3`? Won't branch prediction make it moot anyway? I always thought loop unrolling was a thing of the past. I did manage to remove a write/read dependency in the SOR solver, by the way, by using a red/black scheme. That made a huge difference. (When that presentation talked about The Language Feature Which Must Not Be Named, I thought: "What good is `goto` going to do here? I'd use a template." Next slide: Template Metaprogramming.) Anyway, I'll give "manual" (i.e. template) unrolling a try.
Thomas
It's hard to say about loop unrolling without seeing your code and the assembly. And to be honest, I'm more familiar with Altivec (Xbox360 and PS3) assembly than SSE. Branch prediction doesn't matter - the goal is to have several instructions "in flight" at once. If all the instructions in the loop depend on each other, you have gaps between them due to latency. Unrolling means you'll have multiple sets of independent instructions, so some can be issued "in between" the others.
celion