13.1
Pruning Algorithms
A brute-force pruning method is: for every weight, set the
weight to zero and evaluate the change in the error; if it increases too much
then restore the weight, otherwise remove it. On a serial computer, each forward
propagation takes O(W) time, where W is the number of weights. This is repeated for
each of the weights and each of M training
patterns resulting in O(MW2) time
for each pruning pass. A number of passes are usually required. An even more
cautious method would evaluate the change in error for all weights and patterns
and then delete just the one weight with the least effect. This would be
repeated until the least change in error reaches some threshold and could take
O(MW3) time. This would be very
slow for large networks, so most of the methods described in the following take
a less direct approach.
Many of the algorithms can be put into two broad groups. One group
estimates the sensitivity of the error function to removal of elements and
deletes those with the least effect. The other group adds terms to the objective
function that penalize complex solutions. Most of these can be viewed as forms
of regularization. Many penalize large connection weights and are similar to
weight decay rules. A term proportional to the sum of all weight magnitudes, for
example, favors solutions with small weights; those that are nearly zero are
unlikely to influence the output much and can be eliminated. There is some
overlap in these groups because the objective function could include sensitivity
terms.
In general, the sensitivity methods operate on a trained
network. That is, the network is trained, sensitivities are estimated, and then
weights or nodes are removed. The penalty-term methods, on the other hand,
modify the cost function so that minimization drives unnecessary weights to zero
and, in effect, removes them during training. Even if the weights are not
actually removed, the network acts like a smaller
system.