views:

2897

answers:

8

I'm trying to do some parameter estimation and want to choose parameter estimates that minimize the square error in a predicted equation over about 30 variables. If the equation were linear, I would just compute the 30 partial derivatives, set them all to zero, and use a linear-equation solver. But unfortunately the equation is nonlinear and so are its derivatives.

If the equation were over a single variable, I would just use Newton's method (also known as Newton-Raphson). The Web is rich in examples and code to implement Newton's method for functions of a single variable.

Given that I have about 30 variables, how can I program a numeric solution to this problem using Newton's method? I have the equation in closed form and can compute the first and second derivatives, but I don't know quite how to proceed from there. I have found a large number of treatments on the web, but they quickly get into heavy matrix notation. I've found something moderately helpful on Wikipedia, but I'm having trouble translating it into code.

Where I'm worried about breaking down is in the matrix algebra and matrix inversions. I can invert a matrix with a linear-equation solver but I'm worried about getting the right rows and columns, avoiding transposition errors, and so on.

To be quite concrete:

  • I want to work with tables mapping variables to their values. I can write a function of such a table that returns the square error given such a table as argument. I can also create functions that return a partial derivative with respect to any given variable.

  • I have a reasonable starting estimate for the values in the table, so I'm not worried about convergence.

  • I'm not sure how to write the loop that uses an estimate (table of value for each variable), the function, and a table of partial-derivative functions to produce a new estimate.

That last is what I'd like help with. Any direct help or pointers to good sources will be warmly appreciated.


Edit: Since I have the first and second derivatives in closed form, I would like to take advantage of them and avoid more slowly converging methods like simplex searches.

+2  A: 

You might be able to find what you need at the Numerical Recipes in C web page. There is a free version available online. Here (PDF) is the chapter containing the Newton-Raphson method implemented in C. You may also want to look at what is available at Netlib (LINPack, et. al.).

tvanfosson
I'm sure you mean well, but this chapter describes only how to implement Newton-Raphson over functions of one variable, which I already know how to do. My problem is multiple variables.
Norman Ramsey
Addendum: Chapter 9.6 does discuss multiple variables but it is exactly the sort of thing that has given me trouble elsewhere: a bunch of linear algebra with not enough hints about how to translate a real problem into matrices.
Norman Ramsey
+2  A: 

As an alternative to using Newton's method the Simplex Method of Nelder-Mead is ideally suited to this problem and referenced in Numerical Recpies in C.

Rob

RobS
The Nelder-Mead method is *not* well suited because (a) it requires me to conjure an initial simplex from whole cloth and (b) it fails to exploit my lovely first and second derivatives.
Norman Ramsey
Since you say you have reasonable starting values for the parameters, "conjuring" an initial Simplex shouldn't be too hard should it? As for (b) well that's true, depends how wedded you are to those derivatives.
RobS
Conjuring an initial simplex generating points like (x+1,y), (x, y+1). However, Nelder-Mead can be slow. If your function has nice smooth gradients and few false minima, Newton should work pretty good. Watch out for singular Hessian, though.
Mike Dunlavey
+1  A: 

Since you already have the partial derivatives, how about a general gradient-descent approach?

doppelfish
I would strongly discourage this; it converges far slower than using Newton or conjugate gradients. Unless the "bowl" of the local minimum is spherical (i.e. has the same curvature along all 30 dimensions), the direction of the local gradient is likely to be very different from the direction from the current parameters to the minimum. Furthermore, there's a whole lot more guesswork in terms of stepsize when doing simple gradient descent.
SuperElectric
+4  A: 

The Numerical Recipes link was most helpful. I wound up symbolically differentiating my error estimate to produce 30 partial derivatives, then used Newton's method to set them all to zero. Here are the highlights of the code:

__doc.findzero = [[function(functions, partials, point, [epsilon, steps]) returns table, boolean
Where
  point       is a table mapping variable names to real numbers 
              (a point in N-dimensional space)
  functions   is a list of functions, each of which takes a table like
              point as an argument
  partials    is a list of tables; partials[i].x is the partial derivative
              of functions[i] with respect to 'x'
  epilson     is a number that says how close to zero we're trying to get
  steps       is max number of steps to take (defaults to infinity)
  result      is a table like 'point', boolean that says 'converged'
]]

-- See Numerical Recipes in C, Section 9.6 [http://www.nrbook.com/a/bookcpdf.php]




function findzero(functions, partials, point, epsilon, steps)
  epsilon = epsilon or 1.0e-6
  steps = steps or 1/0
  assert(#functions > 0)
  assert(table.numpairs(partials[1]) == #functions,
         'number of functions not equal to number of variables')
  local equations = { }
  repeat
    if Linf(functions, point) <= epsilon then
      return point, true
    end
    for i = 1, #functions do
      local F = functions[i](point)
      local zero = F
      for x, partial in pairs(partials[i]) do
        zero = zero + lineq.var(x) * partial(point)
      end
      equations[i] = lineq.eqn(zero, 0)
    end
    local delta = table.map(lineq.tonumber, lineq.solve(equations, {}).answers)
    point = table.map(function(v, x) return v + delta[x] end, point)
    steps = steps - 1
  until steps <= 0
  return point, false
end


function Linf(functions, point)
  -- distance using L-infinity norm
  assert(#functions > 0)
  local max = 0
  for i = 1, #functions do
    local z = functions[i](point)
    max = math.max(max, math.abs(z))
  end
  return max
end
Norman Ramsey
If your original problem is a minimization, you're probably better off solving the minimization problem than finding the zeros of the partial derivatives. The minimization is probably more stable.
John D. Cook
Yes, agreed completely. Suggeset anything simple to code?
Norman Ramsey
A: 

Maybe you think you have a good-enough solution, but for me, the easiest way to think about this is to understand it in the 1-variable case first, and then extend it to the matrix case.

In the 1-variable case, if you divide the first derivative by the second derivative, you get the (negative) step size to your next trial point, e.g. -V/A.

In the N-variable case, the first derivative is a vector and the second derivative is a matrix (the Hessian). You multiply the derivative vector by the inverse of the second derivative, and the result is the negative step-vector to your next trial point, e.g. -V*(1/A)

I assume you can get the 2nd-derivative Hessian matrix. You will need a routine to invert it. There are plenty of these around in various linear algebra packages, and they are quite fast.

(For readers who are not familiar with this idea, suppose the two variables are x and y, and the surface is v(x,y). Then the first derivative is the vector:

 V = [ dv/dx, dv/dy ]

and the second derivative is the matrix:

 A = [dV/dx]
    [dV/dy]

or:

 A = [ d(dv/dx)/dx, d(dv/dy)/dx]
    [ d(dv/dx)/dy, d(dv/dy)/dy]

or:

 A = [d^2v/dx^2, d^2v/dydx]
    [d^2v/dxdy, d^2v/dy^2]

which is symmetric.)

If the surface is parabolic (constant 2nd derivative) it will get to the answer in 1 step. On the other hand, if the 2nd derivative is very not-constant, you could encounter oscillation. Cutting each step in half (or some fraction) should make it stable.

If N == 1, you'll see that it does the same thing as in the 1-variable case.

Good luck.

Added: You wanted code:

 double X[N];
// Set X to initial estimate
while(!done){
    double V[N];    // 1st derivative "velocity" vector
    double A[N*N];  // 2nd derivative "acceleration" matrix
    double A1[N*N]; // inverse of A
    double S[N];    // step vector
    CalculateFirstDerivative(V, X);
    CalculateSecondDerivative(A, X);
    // A1 = 1/A
    GetMatrixInverse(A, A1);
    // S = V*(1/A)
    VectorTimesMatrix(V, A1, S);
    // if S is small enough, stop
    // X -= S
    VectorMinusVector(X, S, X);
}
Mike Dunlavey
Why inverse matrix if linear equation can be solved?
Dmitriy Matveev
@Dmitriy: Usually nonlinear optimization problems are solved by a series of steps assuming they are locally linear. Vector to trial solution is 1st derivative / 2nd derivative. That's why inverse.
Mike Dunlavey
@Mike: I believe Dmitry is saying (quite correctly) that you should use a linear solver instead of your numerically-fragile explicit matrix inversion.
Jon Harrop
@Jon: OK, thanks.
Mike Dunlavey
A: 

Multivariate minimization is not a trivial task in general case.

J.F. Sebastian
my 0x100 answer :)
J.F. Sebastian
If Knuth paid you a penny for every answer, you'd have one hexadecimal dollar! http://en.wikipedia.org/wiki/Knuth_reward_check
A. Rex
Note that this is a sum of squares problem and, therefore, has a unique global extremum. That simplifies things to the point of being trivial.
Jon Harrop
@Jon: If the model equations were linear, I would agree with you. In pharmacometrics where I work, they are usually not linear (as in this case). There may be a unique global maximum of the log-likelihood, but don't you still need an iterative (or monte-carlo) method to find it?
Mike Dunlavey
@Mike: You do still need an iterative solver, yes, but the general case is only complicated by things like restricted domains or first- or second-order discontinuities that lead to noninvertible Hessians and so forth. This particular problem cannot suffer from any of those complications so it really is trivial to solve. Moreover, the complexity of this case (only 30 variables) is so trivial that computational complexity is irrelevant: any minimizer should solve this problem accurately in the blink of an eye.
Jon Harrop
For example, Newton's method x <- x - f''^-1 * f' will solve this in a few iterations.
Jon Harrop
@Jon: Yeah. Usually we are dealing with mixed-effect models, non-Gaussian observations, sparse data, and users specifying more parameters than the data can estimate. For single-subject rich data, with purely Gaussian observations, the only really tricky issue is getting the right weighting for the observations.
Mike Dunlavey
A: 

My opinion is to use a stochastic optimizer, e.g., a Particle Swarm method.

Paul Nathan
A: 

You are asking for a function minimization algorithm. There are two main classes: local and global. Your problem is least squares so both local and global minimization algorithms should converge to the same unique solution. Local minimization is far more efficient than global so select that.

There are many local minimization algorithms but one particularly well suited to least squares problems is Levenberg-Marquardt. If you don't have such a solver to hand (e.g. from MINPACK) then you can probably get away with Newton's method:

x <- x - (hessian x)^-1 * grad x

where you compute the inverse matrix multiplied by a vector using a linear solver.

Jon Harrop