16.8
Training with Noisy Data
Many studies (e.g., [299], [118], [310], [387], [345], [246], [287], [267]) have noted that adding small amounts of input noise
(jitter) to the training data often aids generalization and fault tolerance.
Training with small amounts of added input noise embodies a smoothness
assumption because we assume that slightly different inputs give approximately
the same output. If the noise distribution is smooth, the network will
interpolate among training points in relation to a smooth function of the
distance to each training point.
With jitter, the effective target function is the result of
convolution of the actual target with the noise density [307], [306]. This is typically a smoothing operation. Averaging the
network output over the input noise gives rise to terms related to the magnitude
of the gradient of the transfer function and thus approximates regularization
[307], [306], [45].
Training with jitter helps prevent overfitting in large networks
by providing additional constraints because the effective target function is a
continuous function defined over the entire input space whereas the original
target function is defined only at the specific training points. This constrains
the network and forces it to use excess degrees of freedom to approximate the
smoothed target function rather than forming an arbitrarily complex surface that
just happens to fit the sampled training data. Even though the network may be
large, it models a simpler system.
Training with noisy inputs also gives rise to effects similar to
weight decay and gain scaling. Gain scaling [228], [171] is a heuristic that has been proposed as a way of
improving generalization. (Something like gain scaling is also used in [252] to "moderate" the outputs of a classifier.) Effects
similar to training with jitter (and thus similar to regularization) can be
achieved in single-hidden-layer networks by scaling the sigmoid gains [305], [306]. This is usually much more efficient than tediously
averaging over many noisy samples. The scaling operation is equivalent to
where σ2 is the variance of the input noise. This
has properties similar to weight decay. The development of weight decay terms as
a result of training single-layer linear perceptrons with input noise is shown
in [167]. Effects of training with input noise and its relation
to target smoothing, regularization, gain scaling and weight decay are
considered in more detail in chapter 17.