9.1
Adaptive Learning Rate Methods
Many of the methods listed here are adaptive learning rate
schemes. As noted in section 6.1, the often recommended learning rate of η = 0.1
is a somewhat arbitrary value that may be completely inappropriate for a given
problem. For one thing, the magnitude of the gradient depends on how the targets
are scaled; for example, the average error will tend to be higher in a network
with linear output nodes and targets in a (-1000,1000) range than in a network
with sigmoid output nodes and targets in (0,1). Also, when sum-of-squares error
is used rather than mean squared error, the size of the error and thus the best
learning rate may depend on the size of the training set [114]. The effective learning rate is amplified by
redundancies such as near duplication of training patterns and correlation
between different elements of the same pattern, and by internal redundancies
such as correlations between hidden unit activities. The latter depend in part
on the size and configuration of the network but change as the network learns so
different learning rates may be appropriate in different parts of the network
and the best values may change as learning progresses.
Given the difficulty of choosing a good learning rate a priori, it
makes sense to start with a "safe" value (i.e., small) and adjust it depending
on system behavior. Some methods adjust a single global learning rate while
others assign different learning rates for each unit or each weight. Methods
vary, but the general idea is to increase the step size when the error is
decreasing consistently and decrease it when significant error increases occur
(small increases may be tolerated).
In general, some care is needed to avoid instability. The
best step size depends on the problem and local characteristics of the E(w) surface (Chapter 8). Values that work well for some problems and some
regions of the error space may not work well for others. It has been noted that
neural networks often have error surfaces with many flat areas separated by
steep cliffs. This is especially true for classification problems with small
numbers of samples. As in driving a car, different speeds are reasonable in
different conditions. A large step size is desirable to accelerate progress
across the smooth, flat regions of the error surface while a small step size is
necessary to avoid loss of control at the cliffs. If the step size is not
reduced quickly when the system enters a sensitive region, the result could be a
huge weight change that throws the network into a completely different region
basically at random. Besides causing problems such as paralysis due to
saturation of the sigmoid nonlinearities, this has the undesirable effect of
essentially discarding previous learning and starting over somewhere
else.