![]() ![]() ![]() ![]() |
| |||||
| |||||
|
![]() |
![]() |
Like all local search techniques, back-propagation may converge to a local minimum of the error rather than the global optimum. This is a fundamental characteristic of local search methods, not a defect of the implementation.
Random Restarts One of the simplest ways to deal with local minima is to train many different networks with different initial weights. Training times can be long, so it may be useful to train a number of networks for a small number of iterations and then choose a fraction of the best for more extensive training. Although there are no guarantees, this tends to weed out bad starting points, favoring those already close to good solutions. Because many of the initial networks may converge toward equivalent solutions, a clustering procedure may be useful to choose a nonredundant subset.
If training is viewed as a dynamic process, each minimum (local or global) can be considered an attractor with a basin of attraction consisting of all the starting points that converge to it during training. With gradient descent, an initial weight vector will basically "roll downhill" until it reaches a minima; the minimum is the attractor and all points that flow into it form the basin of attraction. In figure 8.6, the dotted line separates basins of attraction of the local and global minima. Because the probability that a randomly selected starting point converges to a particular minimum is equal to the probability that it starts in its basin of attraction, the success of the random restart approach will depend on the relative sizes of the basins of attraction.
In theory, if we knew the relative sizes of the basins of attraction (and the distribution of the initial weights), we would be able to calculate how many random restarts would be needed to achieve a certain probability of landing in the basin of a global minima. Easy problems would have large global basins of attraction and hard problems would have small basins. In general, however, it is not practical to map the basins of attraction because evaluation of each point requires training a network to convergence. These ideas are, therefore, more useful as an explanation of why a problem is hard than as a tool to improve training.
A further complication is that the basins of attraction depend on the weight-change method as well as the static E(w) function so different learning algorithms may have different attractors for the same E(w) function. With gradient descent, for example, several local minima may have large regions of attraction. In principle, these shrink to infinitesimal points with simulated annealing and almost all points belong to the basin of attraction of the global minimum. Simple parameter changes in things like the learning rate and momentum also affect the boundaries. In gradient descent with a small step size, the basins are generally contiguous and have smooth boundaries. When the learning rate is too high, however, the process can become chaotic and the basins of attraction may become disjoint with very complex, possibly fractal, boundaries [216], [217].
With stochastic algorithms the discrete basins of attraction are replaced by probability density functions. For every ending point z there is a density function over x measuring the probability that starting point x ends at z. Loosely speaking, the regions of high probability can be thought of as the basin of attraction of z.
The standard solution to the problem of local minima is to improve the optimization algorithm. In the random restart approach, the optimization routine is augmented with an outer loop searching over initial starting points—simple random search in this case. This will not be effective if the global attractor basins are too small to find by random sampling. It is possible to consider more sophisticated outer search techniques, but use of such techniques is not widespread. Another approach is to use stochastic search methods by introducing randomness at a lower algorithmic level. Simple examples include online training, training with added input noise (jitter), or adding noise to the weights during updates. If these do not work, more sophisticated global search techniques such as simulated annealing or genetic algorithms may be needed.
Instead of improving the optimization algorithm, an alternative approach is to avoid creating local minima in the first place. If we knew more about their causes, we might be able to design networks and training algorithms without poor local minima. Unfortunately, relatively little is known. The following paragraphs outline a few results.
Single-layer networks with linear outputs implement linear functions and minimum-MSE linear regression has a quadratic error surface with a single minimum so one might guess that single-layer networks do not have local minima. This ignores effects of the node nonlinearity.
Sontag and Sussmann [352], [353] (see also [408]) give conditions where this intuition is true. They show that if (1) the cost function does not penalize overclassification and (2) the data are linearly separable, then there are no local minima that are not global minima and gradient descent converges (to within a tolerance) globally from any starting point in a finite number of steps. Condition 1 means that the error is taken to be zero when the output "goes beyond what the target asks." With tanh nodes, for example, the error is taken to be zero when y ≥ 0.9 for t = 0.9 or when y ≤ -0.9 for t = -0.9, for target t and output y. Similar relations apply for sigmoids and other nonlinearities. This has been called the LMS-threshold cost function. Use of unobtainable targets, for example, 0 and 1 for the sigmoid, might be considered a special case.
There can be local minima when either of the two conditions fail. Dips like those in figure 8.3 can occur when condition 1 is not satisfied and troughs as in figure 8.4 may exist because the data is not linearly separable.
Auer, Herbster, and Warmuth [13] show that local minima can exist in a network consisting of a single neuron whenever the composition of the loss function (error function) with the node activation function is continuous and bounded. This result favors the entropic error function because the sigmoid and MSE error function meet the criterion and the sigmoid and entropic error function do not. Reference [13] also shows that artificial data sets can be constructed in which the number of minima grows exponentially with the input dimension.
It has been pointed out [58], [59] that one can find linearly separable data sets for which the MSE cost function has minima at weight vectors that do not separate all patterns correctly. The perceptron algorithm would classify these data correctly whereas back-propagation would not. This does not necessarily mean the MSE function has local minima though, it just shows that it differs from the function that counts the number of classification errors.
The results just discussed show that gradient descent will separate linearly separable data sets if condition 1 is satisfied (i.e., if overclassification is not penalized).
Similar results for one-hidden-layer networks have been shown in [145], [127]. If the data are linearly separable, the network has a pyramidal structure after the inputs and a fullrank weight matrix, and the LMS-threshold cost function is used then no local minima exist and back-propagation will separate the data. A distinction is made between spurious local minima, which can be eliminated by the use of the LMS-threshold error function, and more serious structural local minima.
Together, these results show that there need not be local minima for linearly separable data sets. That is, there may be local minima in some set-ups, but steps can be taken to eliminate them. An important condition appears to be the use of the LMS-threshold error or the use of unobtainable targets, for example, 0 and 1 for the sigmoid, which eliminate the spurious local minima.
Of course, it is well-known that the perceptron learning algorithm will always find a set of weights that correctly classifies a linearly-separable data set. Optimal convergence of on-line back-propagation for linearly separable data, akin to the perceptron convergence property, is shown in [144] for a single-hidden-layer network using the LMS-threshold error function. The demonstration is similar to the perceptron convergence theorem. A side point of the paper is that on-line learning can be qualitatively different from batch-mode learning when the learning rate is not small and it is not necessarily a crude approximation of gradient descent.
These results are intriguing but, unfortunately, most interesting problems are not linearly separable. It is worth noting that a data set can sometimes be made linearly separable by changing the way variables are represented, but there are many other issues to consider.
Auer, Herbster, and Warmuth [13] show that a single-layer net has no poor local minima when the transfer function and loss function are monotonic and the data are realizable (E(w) = 0 for some weights w). This is not restricted to binary targets; linearly separable data sets with binary targets are realizable, but there are also realizable data sets with continuous targets.
It has been suggested that (nearly) flat areas in error surface are sometimes mistaken for minima and that true local minima might be relatively rare. That is, the error may decrease so slowly as the weight vector creeps across a flat spot that it appears as if learning has stopped because the system has reached a minimum. Internally, it might be possible to observe the weights moving in a consistent direction but from the outside it may be impossible to tell the difference. The fact that these are not minima is sometimes shown because the error resumes its decrease (sometimes sharply) if training is continued long enough for the system to cross the flat area. (Figure 6.4andFigure6.12 show examples of this.)
Example It is easy to illustrate local minima in small networks. Figure 8.4 shows one case. An example of a very simple network having local minima is described by McClelland and Rumelhart [261]. The task is to do an identity mapping with a 1/1/1 network. The network has one input, one hidden unit, and one output. No bias weights are used so there are only two weights. The nonlinearities are sigmoids.
Given 0 or 1 as an input, the net should reproduce the value at the output. Figure 8.7 shows the error as a function of the weights w1 and w2. There is a global minimum in the lower left quadrant, a poor local minimum in the upper right quadrant, and a saddle point at the origin separating the two basins of attraction. The saddle point is visible in figure 8.7b. Although the poor minimum appears to be narrower in the contour plot, both have basins of attraction of approximately equal size and a random weight vector has a roughly equal chance of landing in either.
This example is contrived, of course. When adaptable bias weights are added, the local minimum disappears leaving two global minima separated by a saddle point. Use of tanh nodes and {-1, +1} inputs would also remove the asymmetry causing the problem.
The existence of real local minima in nontrivial networks was demonstrated by McInerney et al. [263], [262]. (These are an abstract and an unpublished technical report. HechtNielsen [162] reports some of the details.) After extensive simulations using closed-form expressions for the gradients and second derivatives, they were able to find a point on the error surface where the error was higher than at other points, where all gradients were zero, and where the Hessian was strongly positive-definite.
Aside from the fact that the architecture is poorly matched to the problem, the 1/1/1 network in the previous example is very small. The common wisdom is that if there are enough weights and/or hidden units then local minima do not exist or are not a major problem. Indeed, when bias weights are added to the simple network above, the local minimum disappears. McClelland and Rumelhart [261], [132] state, "[i]n problems with many hidden units, local minima seem quite rare." The extra degrees of freedom presumably provide more ways for potential local minima to flow into lower areas and eventually reach a global minimum. Of course, there are no guarantees as it is always possible to add extra degrees of freedom in ways that do not fix the problem.
A series of papers [301], [413], [150], [414], [415] have shown that if there are as many hidden nodes as there are training patterns then, with probability 1, there are no suboptimal local minima. According to Yu [413], a sufficient condition is that the network be able to fit every training pattern exactly, giving Emin = 0. Assuming consistent training data, this is always possible in a single-hidden-layer net when the number H of hidden nodes is as large as the number M of unique training patterns (actually H ≥ M - 1 will do). Because most data sets contain regularities, fewer nodes will suffice in most cases; however, M - 1 hidden nodes in a single hidden layer will be necessary when the patterns are colinear with alternating target classes [280].
According to these results, the error surface will have no nonglobal minima if there are enough hidden nodes. This helps explain empirical observations that it is often easier to train larger networks than it is to train small ones. Larger networks seem to be less sensitive to parameters and less likely to become stuck during training. Of course, the requirement that the network be able to fit the data exactly allows it to "memorize" the data and conflicts with heuristic rules for obtaining good generalization so steps such as early stopping or pruning will be needed to avoid overfitting.
|
![]() | ||
![]() |
![]() |
![]() | |
Books24x7.com, Inc. © 1999-2001 – Feedback |