+3  A: 

You answered your own question when you said you need to have your learning rate change as the network learns. There are a lot of different ways you can do it.

The simplest way is to reduce the learning rate linearly with number of iterations. Every 25 (or some other arbitrary number), subtract a portion off of the rate until it gets to a good minimum.

You can also do it nonlinearly with number of iterations. For example, multiply the learning rate by .99 every iteration, again until it reaches a good minimum.

Or you can get more crafty. Use the results of the network to determine the network's next learning rate. The better it's doing by its fitness metric, the smaller you make its learning rate. That way it will converge quickly for as long as it needs to, then slowly. This is probably the best way, but it's more costly than the simple number-of-iteration approaches.

Welbog
A: 

Perhaps code in a negative-feedback loop into the learning algorithm, keyed to the rate. Learning rate values that start to swing too wide hit the moderating part of the feedback loop, causing it to swing the other way, at which point the opposing moderation force kicks in.

The state vector will eventually settle into an equilibrium that strikes a balance between "too much" and "too little". It's how many systems in biology work

Josh E
+5  A: 

Sometimes the process of decreasing the learning rate over time is called "annealing" the learning rate.

There are many possible "annealing schedules", like having the learning rate be a linear function of time:

u(t) = c / t

...where c is some constant. Or there is the "search-then-converge" schedule:

u(t) = u(0) / (1 + t/T)

...which keeps the learning rate nearly constant for the first T iterations, and then adjusts it. Of course, for both of these approaches you have to tune the parameters c or T, but hopefully introducing them will help more than it will hurt. :)

Some references:

  • Learning Rate Schedules for Faster Stochastic Gradient Search, Christian Darken, Joseph Chang and John Moody, Neural Networks for Signal Processing 2 --- Proceedings of the 1992 IEEE Workshop, IEEE Press, Piscataway, NJ, 1992.
  • A Stochastic Approximation Method, Herbert Robbins and Sutton Monro, Annals of Mathematical Statistics 22, #3 (September 1951), pp. 400–407.
  • Neural Networks and Learning Machines (section 3.13 in particular), Simon S. Haykin, 3rd edition (2008), ISBN 0131471392, 9780131471399
  • Here is a page that briefly discusses learning rate adaptation.
Nate Kohl
A: 
Zaid