I need to call Atan on millions of values per second. Is there a good library to perform this operation in batch very fast. For example, a library that streams the low level logic using something like SSE?
I have profiled the application, and I know that this call to Atan is a bottleneck.
I know that there is support for this in OpenCL, but I would prefer to do this operation on the CPU. The target machine might not support OpenCL.
I also looked into using OpenCV, but it's accuracy for Atan angles is only ~0.3 degrees. I need accurate results.