5.4.1
Momentum
A common modification of the basic weight update rule is the
addition of a momentum term. The idea is to
stabilize the weight trajectory by making the weight change a combination of the
gradient-decreasing term in equation 5.23 plus a fraction of the previous weight change.
The modified weight change formula is
(5.25) |
 |
That is, the weight change Δw(t) is a combination of a step down the negative
gradient,
plus a
fraction 0
≤ α < 1 of the previous weight change.
Typical values are 0
≤ α < 0.9.
This gives the system a certain amount of inertia since the
weight vector will tend to continue moving in the same direction unless opposed
by the gradient term. Effects of momentum are considered in more detail in section 6.2. Briefly, momentum tends to damp oscillations in
the weight trajectory and accelerate learning in regions where∂E/∂w is small.
5.4.2 Weight
Decay
Another common modification of the weight update rule is the
addition of a weight decay term. Weight decay
is sometimes used to help adjust the complexity of the network to the difficulty
of the problem. The idea is that if the network is overly complex, then it
should be possible to delete many weights without increasing the error
significantly. One way to do this is to give the weights a tendency to drift to
zero by reducing their magnitudes slightly at each iteration. The update rule
with weight decay is then
(5.26) |
 |
where 0 ≤ρ << 1 is the weight decay parameter. ∂E/∂wi = 0 for some
wI, then wi will decay to zero exponentially.
Otherwise, if the weight really is necessary then ∂E/∂wi will be nonzero and the two terms
will balance at some point, preventing the weight from decaying to zero. Weight
decay is considered in more detail in sections 6.2.4 and section 16.5 and chapter 13.