103

7
+1  Q:

## How to run gradient descent algorithm when parameter space is constrained?

I would like to maximize a function with one parameter.

So I run gradient descent (or, ascent actually): I start with an initial parameter and keep adding the gradient (times some learning rate factor that gets smaller and smaller), re-evaluate the gradient given the new parameter, and so on until convergence.

But there is one problem: My parameter must stay positive, so it is not supposed to become <= 0 because my function will be undefined. My gradient search will sometimes go into such regions though (when it was positive, the gradient told it to go a bit lower, and it overshoots).

And to make things worse, the gradient at such a point might be negative, driving the search toward even more negative parameter values. (The reason is that the objective function contains logs, but the gradient doesn't.)

What are some good (simple) algorithms that deal with this constrained optimization problem? I'm hoping for just a simple fix to my algorithm. Or maybe ignore the gradient and do some kind of line search for the optimal parameter?

+2  A:

Without knowing more about your problem, it's hard to give specific advice. Your gradient ascent algorithm may not be particularly suitable for your function space. However, given that's what you've got, here's one tweak that would help.

You're following what you believe is an ascending gradient. But when you move forwards in the direction of the gradient, you discover you have fallen into a pit of negative value. This implies that there was a nearby local maximum, but also a very sharp negative gradient cliff. The obvious fix is to backtrack to your previous position, and take a smaller step (e.g half the size). If you still fall in, repeat with a still smaller step. This will iterate until you find the local maximum at the edge of the cliff.

The problem is, there is no guarantee that your local maximum is actually global (unless you know more about your function than you are sharing). This is the main limitation of naive gradient ascent - it always fixes on the first local maximum and converges to it. If you don't want to switch to a more robust algorithm, one simple approach that could help is to run n iterations of your code, starting each time from random positions in the function space, and keeping the best maximum you find overall. This Monte Carlo approach increases the odds that you will stumble on the global maximum, at the cost of increasing your run time by a factor n. How effective this is will depend on the 'bumpiness' of your objective function.

+2  A:

A simple trick to restrict a parameter to be positive is to re-parametrize the problem in terms of its logarithm (make sure to change the gradient appropriately). Of course it is possible that the optimum moves to -infty with this transformation, and the search does not converge.

+3  A:
1. Each time you update your parameter, check to see if it's negative, and if it is, clamp it to zero.
2. If clamping to zero is not acceptable, try adding a "log-barrier" (Google it). Basically, it adds a smooth "soft" wall to your objective function (and modifying your gradient) to keep it away from regions you don't want it to go to. You then repeatedly run the optimization by hardening up the wall to make it more infinitely vertical, but starting with the previously found solution. In the limit (in practice only a few iterations are needed), the problem you are solving is identical to the original problem with a hard constraint.
+1 for the log-penalty method
+1  A:

At each step, constrain the parameter to be positive. This is (in short) the projected gradient method you may want to google about.

+1  A:

I have three suggestions, in order of how much thinking and work you want to do.

First, in gradient descent/ascent, you move each time by the gradient times some factor, which you refer to as a "learning rate factor." If, as you describe, this move causes x to become negative, there are two natural interpretations: Either the gradient was too big, or the learning rate factor was too big. Since you can't control the gradient, take the second interpretation. Check whether your move will cause x to become negative, and if so, cut the learning rate factor in half and try again.

Second, to elaborate on Aniko's answer, let x be your parameter, and f(x) be your function. Then define a new function g(x) = f(e^x), and note that although the domain of f is (0,infinity), the domain of g is (-infinity, infinity). So g cannot suffer the problems that f suffers. Use gradient descent to find the value x_0 that maximizes g. Then e^(x_0), which is positive, maximizes f. To apply gradient descent on g, you need its derivative, which is f'(e^x)*e^x, by the chain rule.

Third, it sounds like you're trying maximize just one function, not write a general maximization routine. You could consider shelving gradient descent, and tailoring the method of optimization to the idiosyncrasies of your specific function. We would have to know a lot more about the expected behavior of f to help you do that.

A:

You're getting good answers here. Reparameterizing is what I would recommend. Also, have you considered another search method, like Metropolis-Hastings? It's actually quite simple once you bull through the scary-looking math, and it gives you standard errors as well as an optimum.

metropolis hastings is lightyears away from the original problem.
@Alexandre: The first sentence said the goal was to maximize a function. MH can be easily constrained to avoid a forbidden region by constraining the proposal distribution. Gradients can be noisy and problematic, especially if they are calculated by finite difference or if the function is nearly flat.
MCMC methods (and related stochastic gradient methods) are used in cases where everything else fails. There is no indication that the original problems needs the poor convergence of nondeterministic methods.
+1  A:

Just use Brent's method for minimization. It is stable and fast and the right thing to do if you have only one parameter. It's what the `R` function `optimize` uses. The link also contains a simple C++ implementation. And yes, you can give it MIN and MAX parameter value limits.