views:

2026

answers:

16

I would like to compute both the sine and co-sine of a value together (for example to create a rotation matrix). Of course I could compute them separately one after another like a = cos(x); b = sin(x);, but I wonder if there is a faster way when needing both values.

Edit: To summarize the answers so far:

  • Vlad said, that there is the asm command FSINCOS computing both of them (in almost the same time as a call to FSIN alone)

  • Like Chi noticed, this optimization is sometimes already done by the compiler (when using optimization flags).

  • caf pointed out, that functions sincos and sincosf are probably available and can be called directly by just including math.h

  • tanascius approach of using a look-up table is discussed controversial. (However on my computer and in a benchmark scenario it runs 3x faster than sincos with almost the same accuracy for 32-bit floating points.)

  • Joel Goodwin linked to an interesting approach of an extremly fast approximation technique with quite good accuray (for me, this is even faster then the table look-up)

+9  A: 

You could compute either and then use the identity:

cos(x)2 = 1 - sin(x)2

but as @tanascius says, a precomputed table is the way to go.

Mitch Wheat
And be aware that using this method involves computing a power and a square root, so if performance is important, make sure to verify that this is actually faster than computing the other trig function directly.
Tyler McHenry
`sqrt()` is often optimized in hardware, so it may very well be faster then `sin()` or `cos()`. The power is just self multiplication, so don't use `pow()`. There are some tricks to get reasonably accurate square-roots very quickly without hardware support. Lastly, be sure to profile before doing any of this.
caspin
This is an interesting approach, though the additional square root and checks for the right sign, could reduce the performance. Any concrete advice here?
Danvil
Note that √(1 - cos^2 x) is less accurate than computing sin x directly, in particular when x ~ 0.
KennyTM
For small x, the Taylor series for y=sqrt(1-x*x) is very nice. You can get good accuracy with the first 3 terms and it only requires a few multiplies and one shift. I've used it in fixed point code.
phkahler
And don't forget find right sign of operation.
Kirill V. Lyadvinsky
@phkahler: Your Taylor series doesn't apply because when x ~ 0, cos x ~ 1.
KennyTM
@Kenny - I use it for sin values up to about 0.4 where cos is not really 1.
phkahler
+11  A: 

When you need performance, you could use a precalculated sin/cos table (one table will do, stored as a Dictionary). Well, it depends on the accuracy you need (maybe the table would be to big), but it should be really fast.

tanascius
Then the input value needs to be mapped to [0,2*pi] (or smaller with additional checks) and this call to fmod eats away performance. In my (propably suboptimal) implementation I could not gain performance with the look-up table. Would you have any advice here?
Danvil
A precomputed table will almost certainly be slower than just calling `sin` because the precomputed table will trash the cache.
Andreas Brinck
It depends how big the table is. A 256-entry table is often quite accurate enough and uses only 1Kb... if you use it a lot wouldn't it get stuck in the cache without adversely affecting the rest of the app's performance?
John
@Danvil: Here is an example of a sine lookup table http://en.wikipedia.org/wiki/Lookup_table#Computing_sines. However it assumes that you already mapped your input to [0;2pi], too.
tanascius
+4  A: 

When performance is critical for this kind of thing it is not unusual to introduce a lookup table.

Tom Cabanski
A: 

Have you thought of declaring lookup tables for the two functions? You'd still have to "calculate" sin(x) and cos(x), but it'd be decidedly faster, if you don't need a high degree of accuracy.

Frank Shearar
+28  A: 

Modern Intel/AMD processors have instruction FSINCOS for calculating sine and cosine functions simultaneously. If you need strong optimization, perhaps you should use it.

Here is a small example: http://home.broadpark.no/~alein/fsincos.html

Here is another example (for MSVC): http://www.codeguru.com/forum/showthread.php?t=328669

Here is yet another example (with gcc): http://www.allegro.cc/forums/thread/588470

Hope one of them helps. (I didn't use this instruction myself, sorry.)

As they are supported on processor level, I expect them to be way much faster than table lookups.

Edit:
Wikipedia suggests that FSINCOS was added at 387 processors, so you can hardly find a processor which doesn't support it.

Edit:
Intel's documentation states that FSINCOS is just about 5 times slower than FDIV (i.e., floating point division).

Edit:
Please note that not all modern compilers optimize calculation of sine and cosine into a call to FSINCOS. In particular, my VS 2008 didn't do it that way.

Vlad
Do you have any reference for this or a snippet how to call it in C or C#?
Danvil
Are compilers smart enough to use this instruction in case you write: x=cos(a);y=sin(a); That would be great.
phkahler
@phkahler: That would be great. Don't know if such optimization is used by the modern compilers.
Vlad
Do note that if you don't need a full 80 bit double extended result (just double or single precision, for example), it is possible to compute `cos` and `sin` much faster in software than with the `fsincos` hardware instruction. On the other hand, it's also (much) more work to implement that way.
Stephen Canon
@Stephen: which way is it faster to do in software? As Intel's manual states, the FSINCOS instruction itself is quite fast.
Vlad
@phkahler: I am glad I was wrong, see the example of @Chi.
Vlad
The `fsincos` instruction is *not* "quite fast". Intel's own optimization manual quotes it as requiring between 119 and 250 cycles on recent micro-architectures. Intel's math library (distributed with ICC), by comparison, can *separately* compute `sin` and `cos` in less than 100 cycles, using a software implementation that uses SSE instead of the x87 unit. A similar software implementation that computed both simultaneously could be faster still.
Stephen Canon
@Stephen: could you please post the assembler code? I suppose the code uses some version of built-in sin computation (perhaps SSE-specific), and that sincos is available as well.
Vlad
@Stephen: and ~200 cycles is really fast: `FDIV` requires 40.
Vlad
@Vlad: The ICC math libraries are not open-source, and I don't have a license to redistribute them, so I can't post the assembly. I can tell you that there is no built-in `sin` computation for them to take advantage of, however; they use the same SSE instructions as everyone else. To your second comment, the speed relative to `fdiv` is immaterial; if there are two ways to do something and one is twice as fast as the other, it doesn't make sense to call the slower one "fast", regardless of how long it takes relative to some completely unrelated task.
Stephen Canon
@Stephen: I've looked through the Intel's manual; they mention the precision of the functions separately. Without the assembly code I would assume that they achieve speed by sacrificing the accuracy of computation. Could you confirm or disprove this assumption?
Vlad
The software `sin` function in their library delivers full double-precision accuracy. The `fsincos` instruction delivers somewhat more accuracy (double extended), but that extra accuracy gets thrown away in most programs that call the `sin` function, as its result is usually rounded to double precision by later arithmetic operations or a store to memory. In most situations, they deliver the same accuracy for practical use.
Stephen Canon
Note also that `fsincos` is not a complete implementation on its own; you need an additional range reduction step to put the argument into the valid input range for the `fsincos` instruction. The library `sin` and `cos` functions include this reduction as well as the core computation, so they are even faster (by comparison) than the cycle timings that I listed might indicate.
Stephen Canon
+9  A: 

Technically, you’d achieve this by using complex numbers and Euler’s Formula. Thus, something like (C++)

complex<double> res = exp(complex<double>(0, x));
// or equivalent
complex<double> res = polar<double>(1, x);
double sin_x = res.imag();
double cos_x = res.real();

should give you sine and cosine in one step. How this is done internally is a question of the compiler and library being used. It could (and might) well take longer to do it this way (just because Euler’s Formula is mostly used to compute the complex exp using sin and cos – and not the other way round) but there might be some theoretical optimisation possible.


Edit

The headers in <complex> for GNU C++ 4.2 are using explicit calculations of sin and cos inside polar, so it doesn’t look too good for optimisations there unless the compiler does some magic (see the -ffast-math and -mfpmath switches as written in Chi’s answer).

Debilski
+1. You were faster :)
Draco Ater
+5  A: 

I don't believe that lookup tables are necessarily a good idea for this problem. Unless your accuracy requirements are very low the table needs to be very large. And modern CPUs can do a lot of computation while a value is fetched from main memory. This is not one of those questions which can be properly answered by argument (not even mine), test and measure and consider the data.

But I'd look to the fast implementations of SinCos that you find in libraries such as AMD's ACML and Intel's MKL.

High Performance Mark
+2  A: 

There is very interesting stuff on this forum page, which is focused on finding good approximations that are fast: http://www.devmaster.net/forums/showthread.php?t=5784

Disclaimer: Not used any of this stuff myself.

Joel Goodwin
I tried this one as well, and it gave me quite good performance. But sin and cos are computed independently.
Danvil
My feeling is this sine/cosine calculation will be faster than getting sine and using a square root approximation to get cosine, but a test will verify that. The primary relationship between sine and cosine is one of phase; is it possible to code so you could re-use the sine values you calculate for the phase-shifted cosine calls by taking this into account? (This may be a stretch, but had to ask)
Joel Goodwin
Not directly (despite the question asking exactly this). I need sin and cos of a value x and there is no way to know if at some other place I coincidentally computed x+pi/2 ...
Danvil
+2  A: 

For a creative approach, how about expanding the Taylor series? Since they have similar terms, you could do something like the following pseudo:

numerator = x
denominator = 1
sine = x
cosine = 1
op = -1
fact = 1

while (not enough precision) {
    fact++
    denominator *= fact
    numerator *= x

    cosine += op * numerator / denominator

    fact++
    denominator *= fact
    numerator *= x

    sine += op * numerator / denominator

    op *= -1
}

This means you do something like this: starting at x and 1 for sin and cosine, follow the pattern - subtract x^2 / 2! from cosine, subtract x^3 / 3! from sine, add x^4 / 4! to cosine, add x^5 / 5! to sine...

I have no idea whether this would be performant. If you need less precision than the built in sin() and cos() give you, it may be an option.

Tesserex
Actually the i-the sine extension factor is x/i times the i-the cosine extension factor. But I would doubt that using the Taylor series is really fast ...
Danvil
+25  A: 

Modern x86 processors have a fsincos instruction which will do exactly what you're asking - calculate sin and cos at the same time. A good optimizing compiler should detect code which calculates sin and cos for the same value and use the fsincos command to execute this.

It took some twiddling of compiler flags for this to work, but:

$ gcc --version
i686-apple-darwin9-gcc-4.0.1 (GCC) 4.0.1 (Apple Inc. build 5488)
Copyright (C) 2005 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

$ cat main.c
#include <math.h> 

struct Sin_cos {double sin; double cos;};

struct Sin_cos fsincos(double val) {
  struct Sin_cos r;
  r.sin = sin(val);
  r.cos = cos(val);
  return r;
}

$ gcc -c -S -O3 -ffast-math -mfpmath=387 main.c -o main.s

$ cat main.s
    .text
    .align 4,0x90
.globl _fsincos
_fsincos:
    pushl   %ebp
    movl    %esp, %ebp
    fldl    12(%ebp)
    fsincos
    movl    8(%ebp), %eax
    fstpl   8(%eax)
    fstpl   (%eax)
    leave
    ret $4
    .subsections_via_symbols

Tada, it uses the fsincos instruction!

Chi
This is cool! Could you explain what -mfpmath=387 is doing? And does it also work with MSVC?
Danvil
Note that `-ffast-math` and `-mfpmath` lead to different results in some cases.
Debilski
mfpmath=387 will force gcc to use x87 instructions instead of SSE instructions. I suspect MSVC has similar optimizations and flags, but i don't have MSVC handy to be sure. Using x87 instructions will likely be a detriment to performance in other code though, you should also look at my other answer, to use Intel's MKL.
Chi
My old gcc 3.4.4 from cygwin produces 2 separate calls to `fsin` and `fcos`. :-(
Vlad
Tried with Visual Studio 2008 with highest optimizations enabled. It calls 2 library functions `__CIsin` and `__CIcos`.
Vlad
+2  A: 

If you are willing to use a commercial product, and are calculating a number of sin/cos calculations at the same time (so you can use vectored functions), you should check out Intel's Math Kernel Library.

It has a sincos function

According to that documentation, it averages 13.08 clocks/element on core 2 duo in high accuracy mode, which i think will be even faster than fsincos.

Chi
Similarly, on OSX one can use `vvsincos` or `vvsincosf` from the Accelerate.framework. I believe that AMD has similar functions in their vector library as well.
Stephen Canon
+2  A: 

If you use the GNU C library, then you can do:

#define _GNU_SOURCE
#include <math.h>

and you will get declarations of the sincos(), sincosf() and sincosl() functions that calculate both values together - presumably in the fastest way for your target architecture.

caf
Using g++ 4.4.1 this worked for me even without _GNU_SOURCE.
Danvil
A: 

In old assembly days, I simply store the sin/cos in two 256 arrays.

This is the fastest way if you don't care about the precision too much.

forgot
+1  A: 

Many C math libraries, as caf indicates, already have sincos(). The notable exception is MSVC.

  • Sun has had sincos() since at least 1987 (twenty-three years; I have a hard-copy man page)
  • HPUX 11 had it in 1997 (but isn't in HPUX 10.20)
  • Added to glibc in version 2.1 (Feb 1999)
  • Became a built-in in gcc 3.4 (2004), __builtin_sincos().

And regarding look-up, Eric S. Raymond in the Art of Unix Programming (2004) (Chapter 12) says explicitly this a Bad Idea (at the present moment in time):

"Another example is precomputing small tables--for example, a table of sin(x) by degree for optimizing rotations in a 3D graphics engine will take 365 × 4 bytes on a modern machine. Before processors got enough faster than memory to demand caching, this was an obvious speed optimization. Nowadays it may be faster to recompute each time rather than pay for the percentage of additional cache misses caused by the table.

"But in the future, this might turn around again as caches grow larger. More generally, many optimizations are temporary and can easily turn into pessimizations as cost ratios change. The only way to know is to measure and see." (from the Art of Unix Programming)

But, judging from the discussion above, not everyone agrees.

Joseph Quinsey
+1  A: 

I have posted a solution involving inline ARM assembly capable of computing both the sine and cosine of two angles at a time here: Fast sine/cosine for ARMv7+NEON

jcayzac
+1  A: 

This article shows how to construct a parabolic algorithm that generates both the sine and the cosine:

DSP Trick: Simultaneous Parabolic Approximation of Sin and Cos

http://www.dspguru.com/dsp/tricks/parabolic-approximation-of-sin-and-cos

Probes