![]() ![]() ![]() ![]() |
| |||||
| |||||
|
![]() |
![]() |
|
xxx xxxxxxx xxxxxxxxx xx xxx xxxxxxx xx xxxxxxxx xxxxxxxxxxxx xxxxxxxx xx xxxxxxxxxxx xxx xxxxxxxxxxx xx xxx xxxxx xx xxxxx xxxxxxxx xxx xxxxxxx xxxxxxxxx xxxx xxxxxx xxx xxxxx xxxxxxxx xx xxxx xxxxx xxxxxxxxxxxx xxxxxxxxxxx xxxxxx xxx xxxxxxx xx xxxxxxx xxxxxxx xx xxxx xxxxxx xxxxxxxxx xxx xxxxxxx xxx xx xxxxxxx xxxx xxxx xxxxxxxx xxxxx x xxxxxxx xxxxxxxxxx xxxx xx xxxx xxx xxx xxxxxxxx xxxxxxxx xxx xxxxxxx xxxxx xxxx xxxxxxxx xxxx x xxxxxxx xxxxxxx
As an aside, it should be noted that the simple heuristic of deleting small weights is just a heuristic. Obviously, if a weight is exactly zero, it can be deleted without affecting the response. In many cases, it may also be safe to delete small weights because they are unlikely to have a large effect on the output. This is not guaranteed, so some sort of check should be done before actually removing the weight. Input weights may be small because they're connected to an input with a large range; weights to output nodes may be small because the targets have a small range. These objections are less important when inputs and outputs are normalized to unit ranges, but there are cases where small weights may be necessary; a small nonzero bias weight may be useful, for example, to put a boundary near but not exactly at the origin. A more cautious heuristic would be to use the weight magnitude to choose the order in which to evaluate weights for deletion.
xxxxxx xxxx xxxxx xxx xxxxxx xx xxxxx xxx xx xxxxxxxx xx x xxxxxx xxxxxx xxx xxxxxx xxxxxxxxx xxx x xxxxxxx xxxxxxx xxxxxxx xx xxx xxxxxx xxxxxxxxxxx xxxxxxx xxx xxxxxxxx xxxxx x xx xx xxxx xx xxxxx xxx xxxxx xxxxx xxxx xxxxxxx xx xxx xxxxxxxxx xxxxxxx xxx xxxxxxx xxxxxxx xxxx xx xxxx xxxxxxx xxxxxxx xx xxx xxxxx xxxx xxx xxxxxx xx xxxxxxxx xxx xx xxxx xxxxx xxxxx xxxxx xxxxxxx xxxxxxx xxxx xxxx xxxxxx xx xxx xxxxx xxxx xxxxxx xxxxxxxx xxx xxxxx xxxxx xxx xxxxxxxx xxxxxxxxxxxxx xxxxxx xxxxxxx xxx xxxxxxxxxx xxxxx xxxxxxxx xxxxxx xxxxxxxxxxxxxxx xxxxxxxx xxxxx xxx xxxxxxxxxxx xxxxx xxxxxxxx xxxxxx xxxxxxxxxxxxxxxx xxxxxxxx xx xxxx xxxxxxxxx xxxxxxx xxxxx xx xxx xxxxxx xxx xx xxxxxx xxxx xxxxxxx xxxxx xx xxx xxxxx xxxxxxx xxx xxxxxxxxxxxxxxxxx xxxxxx xx xxxxxxxxxxxxxxxx xxxxxxx xxxxxxxxxxxx xxxx xxxxxxxxxxxxxx xxxxxx xxxxx xxxxxxx xx xxxx xxxxxxx
Chauvin [70], uses the cost function
xxxxx x xx x xxxxxxxx xxxxxxxxx xxxxxxxxx xxx xxxx xxx xxxx xxx xxx xx xxxxxx xxxxx xx xxx xxx xx xxxxxx xxxxx xx xxx xxx xxx xx xxxxxxxx xx xxx xxxxx xxxx xx xxx xxxxxx xxx xx xxxxxxx xxxxxxx xxx xxxxxx xxxx xxxxxxxx xxx xxxxxxx xxxxxxxx xxxxxxxx xx xxx xxxxxx xxxxxx xxx xxxxxxxxxx xxxxxxxx xxx xxxxxxxx xxxxxxx xxx xxx xxxxxx xxx xxxxxx xxxxxxxx xx x xxxxxxxxxxxxxx xxxx xxx xxxxxxxx xxxxxx xxxx xxx xxxxxxxx xxxxxxxxxxxxxxxxx xxxxx xx xx xxxxxxxxxx xx xxx xxxxxxxxxxx
If the unit activity has a wide range of variation, the unit probably encodes significant information; if the activity does not change much, the unit probably does not carry much information.
xxxxxxxxxxxxx xxxxxxxxx xxxxxxxxx xxx xxxx xxxxxxxxx xx xxx xxxx xx xx xxxxxxx xxxxxxxxx xxx xxxxxxxx xxxx xxxx xxx xxxxxxxxxx
where n is an integer. For n = 0, e is linear so high and low energy units receive equal differential penalties. For n = 1, e is logarithmic so low energy units are penalized more than high energy units. For n = 2, the penalty approaches an asymptote as the energy increases so high energy units are not penalized much more than medium energy units. Other effects of the form of the function are discussed by Hanson and Pratt [154].
x xxxxxxxxxxxxxxxxxxxx xxxx xxx xxxx xx xxxxx xx xxx xxxx xxxxxxxxx xxxxxx
Since the derivative of the third term with respect to wij is 2μwwij, this effectively introduces a weight decay term into the back-propagation equations. Weights that are not essential to the solution decay to zero and can be removed.
xxxxxxxxxxx xxxxxxxxx xx xxxxx xxxx xxx xxx xxxx xxxxxxxx
No overtraining effect was observed despite long training times (with μer = .1, μen = .1, μw = .001). Analysis showed that the network was reduced to an optimal number of hidden units independent of the starting size.
xxxxxxxx xxxxxxxxxx xxx xxxxxxxx xxxxxx xxxxxx xxxxx xxxxxxxx xxx xxxxxxxxx xxxx xxxxxxxx
where T is the set of training patterns and C is the set of connection indices. The second term (plotted in figure 13.4) represents the complexity of the network as a function of the weight magnitudes relative to the constant wo. At | wi| << wo, the term behaves like the normal weight decay term and weights are penalized in proportion to their squared magnitude. At | wi| >> wo, however, the cost saturates so large weights do not incur extra penalties once they grow past a certain value. This allows large weights which have shown their value to survive while still encouraging the decay of small weights.
xxx xxxxx xx xxxxxx xxxxxxxx xxxx xxxxxx xxx xxxxxxx xx xxx xxxxxxxx xx xx xx xxx xxxxxx xx xxxx xxx xxxx xxx xxxxxxxxxxx xxxxxxx xx xx xx xxx xxxxxx xxx xxx xxxxxxx xxxx xx xxxxxx xx xxxxx xxxxxxxxxx xxx xxxxxxxxx xxxxxx xxxxxxxxxxx xxx xxxxxxxxxx
Ji, Snapp, and Psaltis [195] modify the error function to minimize the number of hidden nodes and the magnitudes of the weights. They consider a single-hidden-layer network with one input and one linear output node. Beginning with a network having more hidden units than necessary, the output is
xxxxx xx xxx xx xxxx xxxxxxxxxxxxx xxx xxxxx xxx xxxxxx xxxxxxx xx xxxxxx xxxx xx xxxxxxx xx xxx xxxxxxxxxx xxx xxxxxx xx xxx xxxxxxx xxxxxxxxx
The significance of a hidden unit is computed by a function of its input and output weights
xxxxx xxxxxxxxx x xxxxx x xxxx xxxx xx xxxxxxx xx xxxxx xx xxx xxxxxxx xxxxxxxxxx xxxxxxxxxx
The error is defined as the sum of εo, the normal sum of squared errors, and ε1, a term measuring node significances.
xxxxx xxxxxx xxxxxxx xxx xxxxxxxx xxxxxxxxx xxxxxxx xxx xxxxxxx xxx xxx xxxxx xxx xxxxxxx xxxxxx xxx xxxxxxx xxxxxxx xxx xxxxxx xxx xxxxxx xxx xxxxxxxx xxxx xxxxxxxxxxx xxx xxxxxxxxxx xxxx xxxxx xxx xxxxxxxxx xxxxx xxxxxxxxx xxxx xxxxx xxxxxxxxxxx xxxxxx xxxxxx
Conflict between the two error terms may cause local minima, so it is suggested the second term be added only after the network has learned the training set sufficiently well.
xxxxxxxxxxxxxx xxxxxx xxx xx xxxx x xxxxxxxx xx xxxxxxx xxxx xx
When εo is large, λ will be small and vice versa.
x xxxxxx xxxxxxxxxxxx xx xxx xxxxxx xxxxxx xxxx xxxxxxxxxx xxxxxx xxxxx xxxxxxx
The new tanh(.) term is modulated by μ:
xxxx xxxxxxx xxxxxx xxxxxxxxx xxx xxxxx xx xx xx xxxx xxxx xxx xxxxxxxxxxxx xxxxxxxxx xxxxxxx xx xxx xxxxx xxxxxxxx xxxxxx xx xxxxxxx
Once an acceptable level of performance is achieved, small magnitude weights can be removed and training resumed. It was noted that the modified error functions increase the training time.
xxxx xx xxx xxxxxxxxxxxx xxxxxxx xxxxxxx xxxxx xxxx xxxxxxxxxxx xxxxxxxxx xxxxxx xxxxx xxxx xxx xxxxxxxx xxxxxxxx xxxxxxxx xxx xxxxxxx xxxxx xxxxxx xxxxx xx x xxxxxxxx xxxxx xxx
third term in equation 13.17, for example, adds a -2μwwij term to the update rule for Wij. This is a simple way to obtain some of the benefits of pruning without complicating the learning algorithm much. A weight decay rule of this form was proposed by Plaut, Nowlan, and Hinton [299].
xxxxxxxx xxxxx xxxxxxxx xxxxxxx xxxxxx xxxx xxxxxxxx
The second term adds -λ sgn(Wij) to the weight update rule. Positive weights are decremented by λ and negative weights are incremented by λ.
x xxxxxxxx xx xxx xxxxxxx xxx xxxxxxx xxxx xx xxxx xx xxxxx xx xxxxx xxxxxx xxxxxxx xxxx xxxx xxxxx xxxxxxxxxx xxxx xxxxxxx xxxx x xxxxxx xxxxx xxxxxxxxxx xxxx xxxx xxxx xx xx xxxxxxxxx xxxxxxx xxx xxxxxx xxxxxxxxxxx xxxx xx xxxxxxxx xxxxx xxxxxxxxx xxxx xx xxxxxx xxx xxxxxxx xxxxxxxx xxxx x xxxxxxx xxxxxx xxxxxx xxx xxxxxx xxxxx xxxxxxxx xxxx xxxxxx xxxxxxxx xxxxx xxxx x xxxx xxxxxxx xxxxxxx xxxx xxxxxxxx xxx xxxxx xxxxxxxxxxx xxxxxxxxxxxx xx xxx xxxxxxx xx x xxxxxxx xx xxxxxxxxxx xxxxxxxx xxxx xx xxxxxxx xxxxx xxxx xxxxxxxxxxxx xxxx x xxxxxx xxxxx xx xxx xxxxxxx xxxx xx xxxxxxx xx xxx xxxxxxxxxxxx xxxxxx xxxxxxxxx xx xxx xxxxxxxxxxxx xxxxxxxx xx xxx xxxxxxxxxx xxx xxxxxxxx xxx xxxxxx xxx xxx xxxxxx xxxx xxxxxxxx xx xxxx xxxx xxxxxxxxxxxxx xxxxx xxxxxx xxxxxxxxxxxx xxxx xxx xxxxxx xxxxxxxx xxxxxx x xxxxxx xxxxx xxxxxxxxxx xxx xxxxx xxxxxxx xx xxxxx xxxxxx xxxxxxxx xxxxxxxx xxx xxxx xxxxxxxxxx xx xxx xxxxxx xxxxxxxx xxx xx xxxx xxxx xxx xxxxxx xxxxx xx xxx xxxxx xxxxxxxxx xx xxxxxxxxx xxxx xxxx xxx xxxxxxxxx xxx xxxx xxx xxxxx xxxxxxx xxx xxxxxxx xxx xxxx xxxxxxx xx xxxxxxxx xxx xxxx xxxxxxxx xxxxxx
Bias Weights Some authors suggest that bias weights should not be subject to weight decay. In a linear regression y = wTx + θ, for example, the bias weight θ compensates for the difference between the mean value of the output target and the average weighted input. There is no reason to prefer small offsets, so θ should have exactly the value needed to remove the mean error.
xx x xxxxxxxxx xxxx xxxxxxxx x x xxxxx x xxxxxxxx xxx xxxx xxxxxx xxx xxx xxxxxx xx xxxxxxxx xxx xxxxxxxx xxxxxxxx xx xxx xxxxx xxxxxx xx xxxx xxx xxxxxx xxxxxxx xxx xxx xxxx xxx xxxxxx xx xxx xxxx xxxxxx xxxxxxx xxxx xxx xxxxxxxx xx xxx xxxxxxxx xxxxxxx xxxxx xxxxx xxxxx xx xxx xxxxxxx xxxxxxxxxx xxxxxxx xx xxx xxxx xxxxxx xx xxx xxxxxxx xx xxxxx xxxxxx xxx xxxxxxxx xxxxxx xx xxx xxxxxxx xxxxxxx xxx xxxxxxxx xx xxx xxxxxxxx xxxx xxx xxxxxx xx xxxxxxxxx x xxx xxxxx x xxxxxxxx xxx xxxx xxxxxxx xxxxxxxxx xx xx x xx xx xxxxxx xxxxx xxxxx xxxxx xxx xxxxxxxx xxxx xxxx xxx xxxxxx xx xxx xxxx xxxx xx xxxxxxx xxx xxxxxx xxxx xxxxxxxx xxxx xxxx xxxxxxx xxxxxx xx xxxxxxx xx xxxxxx xx xxxxx xxx xxxxxxxxx xxxxxx xxxxxx
|
![]() | ||
![]() |
![]() |
![]() | |
Books24x7, Inc. © 1999-2002 – Feedback |