17.4
Extension to General Layered Neural Networks
The results previously discussed relating training with
jittered data and regularization hold for any network. The analysis for gain
scaling, however, is valid only for networks with a single hidden layer and a
linear output node. More general feedforward networks have multiple layers and
nonlinear output nodes. Even though the invariance property does not hold for
these networks, these results lend justification to the idea of gain scaling [228], [171] and weight decay as heuristics for improving
generalization.
The gain scaling analysis uses a GCDF nonlinearity in place of the
usual sigmoid nonlinearity. Because these functions have similar shapes, this is
not an important difference in terms of representation capability. (Differences
might be observed in training dynamics, however, because the GCDF has flatter
tails.) The precise form of the sigmoid is usually not important as long as it
is monotonic nondecreasing; the usual sigmoid is widely used because its
derivative is easily calculated.
The GCDF nonlinearity is used here because it has a convenient
shape invariance property under convolution with a Gaussian input noise density.
There may be other nonlinearities that, although not having this shape
invariance property, are such that their expected response can still be
calculated efficiently using a similar approach. If, for example, g(x) ∗ pn(x) = h(x), the
function h(x) may be different in form from
g(x), but still reasonably easy to calculate.
As a specific example, if g(x) is a step
function and pn(x) is uniform (both in one dimension), then h(x) is a semilinear ramp function: 0 for x < α, equal to x for -α ≤ x ≤ α, and 1 for x > α.
The expected network response can then be computed as a linear sum of h(x) nonlinearities rather than a linear sum of
g(x) nonlinearities. Different nonlinearities
must be used to calculate the normal and expected responses, but this is still
much faster than averaging over many presentations of noisy samples.
The scaling results can also be applied to radial basis
functions [271], [272], [300], which generally use Gaussian PDF hidden units and a
linear output summation. The convolution of two spherical Gaussian PDFs with variances σ
21 and σ 22 produces a third
Gaussian PDF with variance σ 23 = σ
21 + σ 22, so the expected response
of these networks to noise is easily calculated using similar shape invariant
scaling.