views:

3945

answers:

21

I've been profiling an application all day long and, having optimized a couple bits of code, I'm left with this on my todo list. It's the activation function for a neural network, which gets called over a 100 million times. According to dotTrace, it amounts to about 60% of the overall function time.

How would you optimize this?

    public static float Sigmoid(double value) {
        return (float) (1.0 / (1.0 + Math.Pow(Math.E, -value)));
    }
+2  A: 

Idea: Perhaps you can make a (large) lookup table with the values pre-calculated?

Vilx-
I'll give that a try. Hopefully the table won't grow to titanic proportions.
hb
hb: this could backfire very badly. If you're not sure about the maximum size you'd have to implement a structure with a limited size (kind of like a cache) and this is not a trivial task.
Andrew from NZSG
I know - gave it a quick test, got OutOfMemoryException. soprano's function helped heaps
hb
Lookup time will be faster than a Math.Exp time?
FerranB
Perhaps a table combined with interpolation, =using fixed point math (i.e. scaled integers) would work. In the old days, the 88000 DSP processor supported that in machine code.
Bdoserror
Large lookup has lesser chance of fitting CPU cache
Rinat Abdullin
+39  A: 

Try:

public static float Sigmoid(double value) {
    return 1.0f / (1.0f + (float) Math.Exp(-value));
}

EDIT: I did a quick benchmark. On my machine, the above code is about 43% faster than your method, and this mathematically-equivalent code is the teeniest bit faster (46% faster than the original):

public static float Sigmoid(double value) {
    float k = Math.Exp(value);
    return k / (1.0f + k);
}

EDIT 2: I'm not sure how much overhead C# functions have, but if you #include <math.h> in your source code, you should be able to use this, which uses a float-exp function. It might be a little faster.

public static float Sigmoid(double value) {
    float k = expf((float) value);
    return k / (1.0f + k);
}

Also if you're doing millions of calls, the function-calling overhead might be a problem. Try making an inline function and see if that's any help.

Ben Alpert
That actually shaved off about 800ms; thanks
hb
Do you know they range that the value parameter could be? If so, consider generating a lookup table.
Jeremy
You changed the sign of value, my math is rusty but I don't think it's the same thing... you should have Math.Exp(-value) according to the initial code.
Marcel Popescu
@Marcel: Nope, he changed e^-value to 1/(e^value), then added with the 1.0 and swapped numerator/denominator.
lacop
+8  A: 

At 100 million calls, i'd start to wonder if profiler overhead isn't skewing your results. Replace the calculation with a no-op and see if it is still reported to consume 60% of the execution time...

Or better yet, create some test data and use a stopwatch timer to profile a million or so calls.

Shog9
A: 

Doing a Google search, I found an alternative implementation of the Sigmoid function.

public double Sigmoid(double x)
{
   return 2 / (1 + Math.Exp(-2 * x)) - 1;
}

Is that correct for your needs? Is it faster?

http://dynamicnotions.blogspot.com/2008/09/sigmoid-function-in-c.html

Haacked
Giving this a try atm, I'll post back in a minute
hb
That was actually slower
hb
+19  A: 

If it's for an activation function, does it matter terribly much if the calculation of e^x is completely accurate?

For example, if you use the approximation (1+x/256)^256, on my Pentium testing in Java (I'm assuming C# essentially compiles to the same processor instructions) this is about 7-8 times faster than e^x (Math.exp()), and is accurate to 2 decimal places up to about x of +/-1.5, and within the correct order of magnitude across the range you stated. (Obviously, to raise to the 256, you actually square the number 8 times -- don't use Math.Pow for this!) In Java:

double eapprox = (1d + x / 256d);
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;

Keep doubling or halving 256 (and adding/removing a multiplication) depending on how accurate you want the approximation to be. Even with n=4, it still gives about 1.5 decimal places of accuracy for values of x beween -0.5 and 0.5 (and appears a good 15 times faster than Math.exp()).

P.S. I forgot to mention -- you should obviously not really divide by 256: multiply by a constant 1/256. Java's JIT compiler makes this optimisation automatically (at least, Hotspot does), and I was assuming that C# must do too.

Neil Coffey
Whoa. This lowered it even further!
hb
+4  A: 

First thought: How about some stats on the values variable?

  • Are the values of "value" typically small -10 <= value <= 10?

If not, you can probably get a boost by testing for out of bounds values

if(value < -10)  return 0;
if(value > 10)  return 1;
  • Are the values repeated often?

If so, you can probably get some benefit from Memoization (probably not, but it doesn't hurt to check....)

if(sigmoidCache.containsKey(value)) return sigmoidCache.get(value);

If neither of these can be applied, then as some others have suggested, maybe you can get away with lowering the accuracy of your sigmoid...

Stobor
Memoization in this case is probably pretty expensive.
Henrik Gustafsson
Henrik: Quite possibly, yes. It may still be worthwhile depending on how often the same value is being passed to the function. I'm not sure how the rest of the algorithm uses this function, but it might re-call it many times, unnecessarily.
Stobor
We're dealing with floats and neural networks, I assume the values will be all over the place :)
Henrik Gustafsson
+1  A: 
Henrik Gustafsson
What you are doing is fixed point math. I prefer to use a scale factor that is 2^n like 256, 1024, 65536. The ideal ones are 2^8 or 2^16 then you can just grab bytes to get at the integer part. You can probably get even a bigger boost if tables are used with fixed point through out the entire app.
Rex Logan
Well, I intentionally did not use fixed point math. I could, of course, but I would rather avoid that additional source of aliasing.
Henrik Gustafsson
Also I did not use any of the more magic bit-twiddling optimizations since the code is supposed to be 1. readable, 2. easy to translate to C# (where most of it would be useless anyway).
Henrik Gustafsson
Of course, using 32 bit ints everywhere would probably be fastest, but given the nature of the sigmoid function fixed point math might not be the best choice. (I don't like the comment limit :)
Henrik Gustafsson
Another method I have used is to generate a table that is an array of floats and then interpolate between the points. To get rid of the quantization effects. Depending on how you build it it can still be fast. Might work well for a function like this that is squeezing the range down.
Rex Logan
Interpolation would certainly work very well here, but I don't think it's worth the extra complexity. Using just the 2k data points there was an error of <0.25% within the range of the table. Just outside it jumps up to almost 0.5%, but that could be fixed by adjusting the table coverage.
Henrik Gustafsson
+4  A: 

Soprano had some nice optimizations your call:

public static float Sigmoid(double value) 
{
    float k = Math.Exp(value);
    return k / (1.0f + k);
}

If you try a lookup table and find it uses too much memory you could always looking at the value of your parameter for each successive calls and employing some caching technique.

For example try caching the last value and result. If the next call has the same value as the previous one, you don't need to calculate it as you'd have cached the last result. If the current call was the same as the previous call even 1 out of a 100 times, you could potentially save yourself 1 million calculations.

Or, you may find that within 10 successive calls, the value parameter is on average the same 2 times, so you could then try caching the last 10 values/answers.

Jeremy
A: 

1) Do you call this from only one place? If so, you may gain a small amount of performance by moving the code out of that function and just putting it right where you would normally have called the Sigmoid function. I don't like this idea in terms of code readability and organization but when you need to get every last performance gain, this might help because I think function calls require a push/pop of registers on the stack, which could be avoided if your code was all inline.

2) I have no idea if this might help but try making your function parameter a ref parameter. See if it's faster. I would have suggested making it const (which would have been an optimization if this were in c++) but c# doesn't support const parameters.

Jeremy
+1  A: 

You might also consider experimenting with alternative activation functions which are cheaper to evaluate. For example:

f(x) = (3x - x**3)/2

(which could be factored as

f(x) = x*(3 - x*x)/2

for one less multiplication). This function has odd symmetry, and its derivative is trivial. Using it for a neural network requires normalizing the sum-of-inputs by dividing by the total number of inputs (limiting the domain to [-1..1], which is also range).

joel.neely
+11  A: 
  1. Remember, that any changes in this activation function come at cost of different behavior. This even includes switching to float (and thus lowering the precision) or using activation substitutes. Only experimenting with your use case will show the right way.
  2. In addition to the simple code optimizations, I would also recommend to consider parallelization of the computations (i.e.: to leverage multiple cores of your machine or even machines at the Windows Azure Clouds) and improving the training algorithms.

UPDATE: Post on lookup tables for ANN activation functions

UPDATE2: I removed the point on LUTs since I've confused these with the complete hashing. Thanks go to Henrik Gustafsson for putting me back on the track. So the memory is not an issue, although the search space still gets messed up with local extrema a bit.

Rinat Abdullin
I'm already parallelizing parts of the code, some using Parallel.For and some others using PLINQ. So far, it's worked *great*
hb
Yes, I would suspect. But the whole fun starts when the training algo (I'm using evolutionary one that allows to pick the network structure) is being run on multiple machines))
Rinat Abdullin
Using a LUT for each possible value of the double range would eat all your memory, but if floats are used and some loss of precision can be acceptable, a table could be fit into less than 4k of memory. Your statement 1 is false.
Henrik Gustafsson
What loss of precision do you have in mind?
hb
If an error of ~0.1% in the activation function can be accepted, the C-example I posted below would work if you fix it up a little (first two points in the TODO should do it :)
Henrik Gustafsson
I'll give it a try. I'm also trying to minimize memory usage (not related to your post) - apparently .NET thinks it's a nice idea to pass a huge around array... blah..
hb
I wouldn't say 4k is a lot either way as it is only once per process :)Why I tried for 4k is of course to keep within a page boundary. That is easy o guarantee in C, but harder in C# I suppose; ymmv on that
Henrik Gustafsson
4k is a rather mild understatement))http://rabdullin.com/journal/2009/1/5/caching-activation-function-is-not-worth-it.html
Rinat Abdullin
You're not seriously using a hash-table? That version has already been discussed and rejected several times here already. Did you even read my code? Oh, and read this while you're at it: http://en.wikipedia.org/wiki/Lookup_table
Henrik Gustafsson
Aliasing might just break stuff badly, but there might be ways of fixing that, and the gains of a LUT is potentially big, and are usually worth at least looking into. As it would be used often it would almost certainly be hot in the cache.
Henrik Gustafsson
Hm... mea culpa. I initially confused LUT with complete hashing. But still, using this kind of approach just did not work well in my scenarios and resulted in worse training results.
Rinat Abdullin
"Worse training results" == it takes more attempts to get a reasonably good training result.
Rinat Abdullin
I posted a new answer with a C# implementation, I also included some links to some papers dealing with the errors introduced. I haven't had the time to read them all, but in general they seem to say that quantization have good and bad features, but most are manageable using appropriate techniques.
Henrik Gustafsson
+4  A: 

If you're able to interop with C++, you could consider storing all the values in an array and loop over them using SSE like this:

void sigmoid_sse(float *a_Values, float *a_Output, size_t a_Size){
 __m128* l_Output = (__m128*)a_Output;
 __m128* l_Start  = (__m128*)a_Values;
 __m128* l_End    = (__m128*)(a_Values + a_Size);

 const __m128 l_One        = _mm_set_ps1(1.f);
 const __m128 l_Half       = _mm_set_ps1(1.f / 2.f);
 const __m128 l_OneOver6   = _mm_set_ps1(1.f / 6.f);
 const __m128 l_OneOver24  = _mm_set_ps1(1.f / 24.f);
 const __m128 l_OneOver120 = _mm_set_ps1(1.f / 120.f);
 const __m128 l_OneOver720 = _mm_set_ps1(1.f / 720.f);
 const __m128 l_MinOne     = _mm_set_ps1(-1.f);

 for(__m128 *i = l_Start; i < l_End; i++){
  // 1.0 / (1.0 + Math.Pow(Math.E, -value))
  // 1.0 / (1.0 + Math.Exp(-value))

  // value = *i so we need -value
  __m128 value = _mm_mul_ps(l_MinOne, *i);

  // exp expressed as inifite series 1 + x + (x ^ 2 / 2!) + (x ^ 3 / 3!) ...
  __m128 x = value;

  // result in l_Exp
  __m128 l_Exp = l_One; // = 1

  l_Exp = _mm_add_ps(l_Exp, x); // += x

  x = _mm_mul_ps(x, x); // = x ^ 2
  l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_Half, x)); // += (x ^ 2 * (1 / 2))

  x = _mm_mul_ps(value, x); // = x ^ 3
  l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver6, x)); // += (x ^ 3 * (1 / 6))

  x = _mm_mul_ps(value, x); // = x ^ 4
  l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver24, x)); // += (x ^ 4 * (1 / 24))

#ifdef MORE_ACCURATE

  x = _mm_mul_ps(value, x); // = x ^ 5
  l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver120, x)); // += (x ^ 5 * (1 / 120))

  x = _mm_mul_ps(value, x); // = x ^ 6
  l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver720, x)); // += (x ^ 6 * (1 / 720))

#endif

  // we've calculated exp of -i
  // now we only need to do the '1.0 / (1.0 + ...' part
  *l_Output++ = _mm_rcp_ps(_mm_add_ps(l_One,  l_Exp));
 }
}

However, remember that the arrays you'll be using should be allocated using _aligned_malloc(some_size * sizeof(float), 16) because SSE requires memory aligned to a boundary.

Using SSE, I can calculate the result for all 100 million elements in around half a second. However, allocating that much memory at a time will cost you nearly two-third of a gigabyte so I'd suggest processing more but smaller arrays at a time. You might even want to consider using a double buffering approach with 100K elements or more.

Also, if the number of elements starts to grow considerably you might want to choose to process these things on the GPU (just create a 1D float4 texture and run a very trivial fragment shader).

Jasper Bekkers
+4  A: 

Off the top of my head, this paper explains a way of approximating the exponential by abusing floating point, (click the link in the top right for PDF) but I don't know if it'll be of much use to you in .NET.

Also, another point: for the purpose of training large networks quickly, the logistic sigmoid you're using is pretty terrible. See section 4.4 of Efficient Backprop by LeCun et al and use something zero-centered (actually, read that whole paper, it's immensely useful).

dwf
+1  A: 

A mild variation on Soprano's theme:

public static float Sigmoid(double value) {
    float v = value;
    float k = Math.Exp(v);
    return k / (1.0f + k);
}

Since you're only after a single precision result, why make the Math.Exp function calculate a double? Any exponent calculator that uses an iterative summation (see the expansion of the ex) will take longer for more precision, each and every time. And double is twice the work of single! This way, you convert to single first, then do your exponential.

But the expf function should be faster still. I don't see the need for soprano's (float) cast in passing to expf though, unless C# doesn't do implicit float-double conversion.

Otherwise, just use a real language, like FORTRAN...

Phil H
When did Math.Exp take floats?
Jimmy
+12  A: 

Have a look at this post. it has an approximation for e^x written in Java, this should be the C# code for it (untested):

public static double Exp(double val) {  
    long tmp = (long) (1512775 * val + (1072693248 - 60801));  
    return BitConverter.Int64BitsToDouble(tmp << 32);  
}

In my benchmarks this is more than 5 times faster than Math.exp() (in Java). The approximation is based on the paper "A Fast, Compact Approximation of the Exponential Function" which was developed exactly to be used in neural nets. It is basically the same as a lookup table of 2048 entries and linear approximation between the entries, but all this with IEEE floating point tricks.

EDIT: According to hb this speeds up Henrik's Sigmoid1/Sigmoid2 by ~50% (900ms vs 500ms). Thanks!

martinus
Curious: Can you simplify (1072693248 - 60801) as 1072632447? Also, can you take it outside the long cast, so it's not added to a double, for speed? Does this affect accuracy and/or performance?
strager
(Hindsight: I realize the subtraction is probably optimized by the compiler, but worth a shot regardless.)
strager
@strager right, this is surely optimized by the compiler. I left it like this because it was partly how the formula was developed, but you can replace it with 1072632447.
martinus
@martinus, Have you tried the other technique? Rewrite as: long tmp = (long)(1512775 * val) + 1072632447;
strager
@strager Nope I have not tried this
martinus
Using this in Henrik's Sigmoid1/Sigmoid2 post offers a ~50% improvement (900ms vs 500ms). Haven't checked accuracy though.
hb
A: 

If you need a giant speed boost, you could probably look into parallelizing the function using the (ge)force. IOW, use DirectX to control the graphics card into doing it for you. I have no idea how to do this, but I've seen people use graphics cards for all kinds of calculations.

erikkallen
or CUDA perhaps.
Jimmy
+5  A: 
Henrik Gustafsson
Thanks for the update.Note, that you might want to pick a different error measurement, since otherwise changing -6.0f; x < 6.0f to 7f kicks the error to 100%.
Rinat Abdullin
Additionally compiling your code in release mode and running it without debugger attached yielded results: 1195ms vs 41ms. That's more than 10 times faster))
Rinat Abdullin
But fixing the Sigmoid1 brought the speed advantage down to 10x. Additionally it is possible to improve Sigmoid2 by 2ms by saving intermediate value. See: http://rabdullin.com/journal/2009/1/5/caching-activation-function-is-not-worth-it.html
Rinat Abdullin
I didn't bother getting smart there. I thought it better to keep that one real simple and close to http://en.wikipedia.org/wiki/Approximation_errorIt would take something completely different to handle v1=0 well, but that is an inherent and well known and understood weakness of that measurement.
Henrik Gustafsson
About the 'local extremas', one way we used to get rid of the effects of that in a project (completely unrelated to machine learning) was to add some noise to the signal. I think adding that touch of nondeterminism would help the learning algorithm not getting stuck there.
Henrik Gustafsson
Re Noise in training - yeah, that's known trick. I've heard of it being used in training hardware ANNs (real-time environment analysis and reaction) with some decent results.
Rinat Abdullin
PS: F# implementation posted. Its even faster.
Rinat Abdullin
+4  A: 

F# Has Better Performance than C# in .NET math algorithms. So rewriting neural network in F# might improve the overall performance.

If we re-implement LUT benchmarking snippet (I've been using slightly tweaked version) in F#, then the resulting code:

  • executes sigmoid1 benchmark in 588.8ms instead of 3899,2ms
  • executes sigmoid2 (LUT) benchmark in 156.6ms instead of 411.4 ms

More details could be found in the blog post. Here's the F# snippet JIC:

#light

let Scale = 320.0f;
let Resolution = 2047;

let Min = -single(Resolution)/Scale;
let Max = single(Resolution)/Scale;

let range step a b =
  let count = int((b-a)/step);
  seq { for i in 0 .. count -> single(i)*step + a };

let lut = [| 
  for x in 0 .. Resolution ->
    single(1.0/(1.0 +  exp(-double(x)/double(Scale))))
  |]

let sigmoid1 value = 1.0f/(1.0f + exp(-value));

let sigmoid2 v = 
  if (v <= Min) then 0.0f;
  elif (v>= Max) then 1.0f;
  else
    let f = v * Scale;
    if (v>0.0f) then lut.[int (f + 0.5f)]
    else 1.0f - lut.[int(0.5f - f)];

let getError f = 
  let test = range 0.00001f -10.0f 10.0f;
  let errors = seq { 
    for v in test -> 
      abs(sigmoid1(single(v)) - f(single(v)))
  }
  Seq.max errors;

open System.Diagnostics;

let test f = 
  let sw = Stopwatch.StartNew(); 
  let mutable m = 0.0f;
  let result = 
    for t in 1 .. 10 do
      for x in 1 .. 1000000 do
        m <- f(single(x)/100000.0f-5.0f);
  sw.Elapsed.TotalMilliseconds;

printf "Max deviation is %f\n" (getError sigmoid2)
printf "10^7 iterations using sigmoid1: %f ms\n" (test sigmoid1)
printf "10^7 iterations using sigmoid2: %f ms\n" (test sigmoid2)

let c = System.Console.ReadKey(true);

And the output (Release compilation against F# 1.9.6.2 CTP with no debugger):

Max deviation is 0.001664
10^7 iterations using sigmoid1: 588.843700 ms
10^7 iterations using sigmoid2: 156.626700 ms

UPDATE: updated benchmarking to use 10^7 iterations to make results comparable with C

UPDATE2: here are the performance results of the C implementation from the same machine to compare with:

Max deviation is 0.001664
10^7 iterations using sigmoid1: 628 ms
10^7 iterations using sigmoid2: 157 ms
Rinat Abdullin
I'll update the C-code with time-keeping, but I can't to F#, and our machines differ so much I think its better if you ran the test.Stay tuned
Henrik Gustafsson
C code updated...have a go :)
Henrik Gustafsson
Hey, you're doing 10^7 iterations, are you not?
Henrik Gustafsson
Duh, yes. I've fixed this typo in the blog post and forgot about this snippet. Thanks.Re C snippet, I'll run it on my machine a bit later. Just need to get some C compiler.
Rinat Abdullin
hm... 630ms; 158ms for your code.
Rinat Abdullin
Basically the same numbers as for the managed code, sweet!
Henrik Gustafsson
Note to self, brush up on F#
hb
These were my numbers in mono:10^7 iterations using sigmoid1: 1661.244000 ms10^7 iterations using sigmoid2: 732.762000 ms
Henrik Gustafsson
+2  A: 

This is slightly off topic, but just out of curiosity, I did the same implementation as the one in C, C# and F# in Java. I'll just leave this here in case someone else is curious.

Result:

$ javac LUTTest.java && java LUTTest
Max deviation is 0.001664
10^7 iterations using sigmoid1() took 1398 ms
10^7 iterations using sigmoid2() took 177 ms

I suppose the improvement over C# in my case is due to Java being better optimized than Mono for OS X. On a similar MS .NET-implementation (vs. Java 6 if someone wants to post comparative numbers) I suppose the results would be different.

Code:

public class LUTTest {
    private static final float SCALE = 320.0f;
    private static final  int RESOLUTION = 2047;
    private static final  float MIN = -RESOLUTION / SCALE;
    private static final  float MAX = RESOLUTION / SCALE;

    private static final float[] lut = initLUT();

    private static float[] initLUT() {
        float[] lut = new float[RESOLUTION + 1];

        for (int i = 0; i < RESOLUTION + 1; i++) {
            lut[i] = (float)(1.0 / (1.0 + Math.exp(-i / SCALE)));
        }
        return lut;
    }

    public static float sigmoid1(double value) {
        return (float) (1.0 / (1.0 + Math.exp(-value)));
    }

    public static float sigmoid2(float value) {
        if (value <= MIN) return 0.0f;
        if (value >= MAX) return 1.0f;
        if (value >= 0) return lut[(int)(value * SCALE + 0.5f)];
        return 1.0f - lut[(int)(-value * SCALE + 0.5f)];
    }

    public static float error(float v0, float v1) {
        return Math.abs(v1 - v0);
    }

    public static float testError() {
        float emax = 0.0f;
        for (float x = -10.0f; x < 10.0f; x+= 0.00001f) {
            float v0 = sigmoid1(x);
            float v1 = sigmoid2(x);
            float e = error(v0, v1);
            if (e > emax) emax = e;
        }
        return emax;
    }

    public static long sigmoid1Perf() {
        float y = 0.0f;
        long t0 = System.currentTimeMillis();
        for (int i = 0; i < 10; i++) {
            for (float x = -5.0f; x < 5.0f; x+= 0.00001f) {
                y = sigmoid1(x);
            }
        }
        long t1 = System.currentTimeMillis();
        System.out.printf("",y);
        return t1 - t0;
    }    

    public static long sigmoid2Perf() {
        float y = 0.0f;
        long t0 = System.currentTimeMillis();
        for (int i = 0; i < 10; i++) {
            for (float x = -5.0f; x < 5.0f; x+= 0.00001f) {
                y = sigmoid2(x);
            }
        }
        long t1 = System.currentTimeMillis();
        System.out.printf("",y);
        return t1 - t0;
    }    

    public static void main(String[] args) {

        System.out.printf("Max deviation is %f\n", testError());
        System.out.printf("10^7 iterations using sigmoid1() took %d ms\n", sigmoid1Perf());
        System.out.printf("10^7 iterations using sigmoid2() took %d ms\n", sigmoid2Perf());
    }
}
Henrik Gustafsson
You just do not give up))
Rinat Abdullin
Just want it to be complete :)
Henrik Gustafsson
+3  A: 

FWIW, here's my C# benchmarks for the answers already posted. (Empty is a function that just returns 0, to measure the function call overhead)

Empty Function:       79ms   0
Original:             1576ms 0.7202294
Simplified: (soprano) 681ms  0.7202294
Approximate: (Neil)   441ms  0.7198783
Bit Manip: (martinus) 836ms  0.72318
Taylor: (Rex Logan)   261ms  0.7202305
Lookup: (Henrik)      182ms  0.7204863
public static object[] Time(Func<double, float> f) {
    var testvalue = 0.9456;
    var sw = new Stopwatch();
    sw.Start();
    for (int i = 0; i < 1e7; i++)
        f(testvalue);
    return new object[] { sw.ElapsedMilliseconds, f(testvalue) };
}
public static void Main(string[] args) {
    Console.WriteLine("Empty:       {0,10}ms {1}", Time(Empty));
    Console.WriteLine("Original:    {0,10}ms {1}", Time(Original));
    Console.WriteLine("Simplified:  {0,10}ms {1}", Time(Simplified));
    Console.WriteLine("Approximate: {0,10}ms {1}", Time(ExpApproximation));
    Console.WriteLine("Bit Manip:   {0,10}ms {1}", Time(BitBashing));
    Console.WriteLine("Taylor:      {0,10}ms {1}", Time(TaylorExpansion));
    Console.WriteLine("Lookup:      {0,10}ms {1}", Time(LUT));
}
Jimmy
Good stuff!Since F# is .NET, you think it would be possible to include that as well?http://research.microsoft.com/en-us/downloads/6f48a466-4294-4973-9e15-25e0ddff422f/
Henrik Gustafsson
A: 

There are a lot of good answers here. I would suggest running it through this technique, just to make sure

  • You're not calling it any more times than you need to.
    (Sometimes functions get called more than necessary, just because they are so easy to call.)
  • You're not calling it repeatedly with the same arguments
    (where you could use memoization)

BTW the function you have is the inverse logit function,
or the inverse of the log-odds-ratio function log(f/(1-f)).

Mike Dunlavey