View Table of ContentsCreate a BookmarkAdd Book to My BookshelfPurchase This Book Online
Skip to Book Content
Book cover image

List of Figures

Neural Smithing: Supervised Learning in Feedforward Artificial Neural Networks
Russell D. Reed and Robert J. Marks II
Copyright © 1999 Massachusetts Institute of Technology

Previous Section Next Section

List of Figures

Figure 1.1: (a) A real neuron collects input signals from other neurons through a dendritic tree, integrates the information, and distributes its response to other neurons via the axon.(b)An artificial neuron model.
Figure 1.2: The sigmoid is a continuous, bounded, monotonic function of its input x. It saturates at 0 for large negative inputs and at 1 for large positive inputs. Near zero, it is approximately linear.
Figure 1.3: A layered feedforward network.
Figure 2.1: Supervised learning is often viewed as function approximation problem. Given a set {(xi, ti)}, i=1 … M, of training pairs with inputs xi and target outputs ti, the goal is to find a function f(x) that captures the input-output relationships illustrated in the training examples, f(x)i) ≈ ti. If the search is successful, the new function can then be used to estimate the correct output for new points not in the original training set. Ideally, the functional form may also be more compact and faster to evaluate.
Figure 2.2: Supervised learning model. In supervised learning, a desired output is specified for every pattern encountered during training. Differences between the network output and training target are treated as errors to be minimized by the training algorithm.
Figure 2.3: Supervised learning can be applied to many different error functions. The figure illustrates a piecewise linear error function with upper and lower tolerance limits; the error is zero when f(x) is within the limits. Functions like this are sometimes useful in engineering applications.
Figure 2.4: In supervised learning with a distal teacher the network output drives another system, T, which produces the final output. This makes the teacher's job easier because targets can be specified at a higher, less detailed, level. When T is a known function, the low-level signals needed for network training can be derived from the high-level errors.
Figure 2.5: Learning with a distal teacher, robot arm example: (a) a robot arm, (b) a neural network translates input commands to joint angles that put the manipulator in the desired position after transformation by the physics of the robot arm.
Figure 3.1: A single-layer perceptron has no hidden layers. One layer of weights connects the inputs directly to the outputs. The outputs are independent so this network can be treated as three separate networks.
Figure 3.2: Node function. Each node I computes a weighted sum of inputs and passes the result through a nonlinearity. Typically a bounded monotomic function such as the sigmoid.
Figure 3.3: Projection of x onto w. The output of a single unit is determined by the inner product of the input vector x and the weight vector w: u = wTx = || w|| || x.|| cos φ. All inputs xwith the same projection onto wproduce the same output. The locus of points with equivalent projections onto w defines a hyperplane perpendicular to w.
Figure 3.4: Effect of bias weights. A linear threshold unit divides its input space into two half-spaces: (a) without bias, the dividing surface must pass through the origin and certain data sets will not be separable; (b) with bias, the dividing surface can be offset from the origin to obtain better classification.
Figure 3.5: Graded responses: (a) a hard-limiter divides the input space with a hyperplane, (b) a sigmoid gives a similar response, but has a smoother transition at the boundary, and (c) a sigmoid with small weights gives an almost linear response for a wide range of inputs.
Figure 3.6: (a) Linearly separable classes can be separated by a hyperplane. In two dimensions, the hyperplane is a line. (b) classes that are not linearly separable cannot be separated by a hyperplane.
Figure 3.7: The two-input exclusive-OR function is a simple binary function that is not linearly separable and thus not learnable by a single-layer perceptron. In more than two dimensions, it generalizes to the parity function, which is 1 when the number of active inputs is odd and 0 when it is even
Figure 3.8: Capacity of a hyperplane. Shown is the probability that a random dichotomy on N points in d-dimensions is linearly separable, for various values of d. The probability is greater than 1/2 for N < 2(d + 1), and less than 1/2 for N > 2(d + 1). As d increases, the transition from high to low probability becomes steeper and approaches a step function.
Figure 3.9: The separation problem, (a) A weight vector must be found that separates the positive samples (open circles) from the negative samples (filled circles). (b) The transformation z = tx reflects the negative examples across the origin and changes the problem to one of finding a weight vector such that w.z > 0 for each transformed pattern z. D(w) is the minimum distance from any point z to the separating hyperplane defined by w. (c) The difficulty of a problem is determined by how much w can be varied and still meet the separating criterion. Easy problems can be satisfied by many w vectors.
Figure 4.1: MLP structure. A multilayer perceptron is a cascade of single-layer perceptrons. There is a layer of input nodes and a layer of output nodes with one or more hidden layers in between.
Figure 4.2: Three layers of linear threshold units are sufficient to define arbitrary regions. Units in the first hidden layer divide the input space with hyperplanes, units in the second hidden layer can form convex regions bounded by these hyperplanes, and output units can combine the regions defined by the second layer into arbitrarily shaped, possibly unconnected, regions. Here, the boundaries are piecewise linear, but any smooth boundary can be approximated with enough units (based on [247].)
Figure 4.3: Single-hidden-layer networks can recognize nonconvex decision regions. Here, eight hidden units in a single layer create a nonconvex square donut. Each dashed line is the decision boundary of one of the hidden units. The hidden unit is active (1) on the side of the line indicated by the pointer and connects to the output unit with the weight indicated next to the pointer. The other numbers show the summed input to the output unit in different regions. Thresholding at 8.5 creates the square donut.
Figure 4.4: Weiland and Leighton [407] provide this example of a nonconvex region recognized by a single-hidden-layer network. Thresholding at 8.7 creates the 'C' shaped region: (a) a nonconvex decision region, and (b) the network. There are 10 hidden nodes. Each has one connection with weight w = ± 1 from either the x or y input. Numbers inside the circles are the node thresholds.
Figure 4.5: Disjoint regions recognized by single-hidden-layer networks. In addition to nonconvex regions, single-hidden-layer networks can also recognize disjoint (unconnected) regions.
Figure 4.6: Certain functions can be implemented exactly by small networks with two hidden layers but require an infinite number of nodes to approximate if constrained to a single hidden layer [255], [254]: (a) a simple function that can be implemented with two hidden layers each containing two threshold units, (b) an approximation by a network with just one hidden layer, (from [255]).
Figure 4.7: AND and OR logic functions. Any logic function can be implemented by a circuit of AND, OR, and NOT gates. Threshold units are at least as powerful as logic circuits because they can implement all the necessary elementary functions, as well as others: (a) the AND function, (b) linear threshold unit implementation of AND, (c) the OR function, and (d) linear threshold unit implementation of OR. The NOT function can be obtained from a singleinput unit with a large negative weight and a zero threshold.
Figure 5.1: Feedforward indexing in an unlayered network. The nodes in a feedforward network can always be indexed so that i > j if the state of node depends on the state of node (perhaps indirectly). Arbitrary connections are allowed from nodes with low indexes to nodes with higher indexes, but not vice versa; i > j implies wji $equiv; 0. This network has no particular function, but illustrates short-cut connections, apparently lateral (but still feedforward) connections, and the fact that outputs can be take from internal nodes.
Figure 5.2: Forward propagation. In the forward pass, the input pattern is propagated through the network to obtain the output. Each node computes a weighted sum of its inputs and passes this through a nonlinearity, typically a sigmoid or tanh function.
Figure 5.3: Backward propagation. Node deltas are calculated in the backward pass, starting at the output nodes and proceeding backwards. At each hidden node i,δi is calculated as a weighted linear combination of values δk, k > i,of the nodes k to which node i sends outputs. Note that the δ values travel backward against the normal direction of the connecting links.
Figure 5.4: Batch-mode back-propagation is a close approximation to true gradient descent. At each step, the weights are adjusted in the direction that minimizes the error the fastest. When the learning rate is small, the weights trace a smooth trajectory down the gradient of the error surface.
Figure 5.5: In on-line learning, weight updates occur after each pattern presentation. The single-pattern derivative terms can be viewed as noisy estimates of the gradient. They are not parallel to it in general, but on average they have a positive projection onto it (because the gradient is the sum of the single-pattern terms) so the error usually decreases after most weight changes. Some terms may have negative projections or large orthogonal deviations, however, which may cause the error to increase occasionally.
Figure 5.6: In on-line learning, patterns are generally presented in a random, changing order and weights are updated after each pattern presentation. Instead of smoothly rolling down the error gradient, the weight vector dances along a semi-random path, mostly moving downhill, but occasionally jumping uphill. Upon reaching a low spot, the weight vector jitters around the minimum but is unable to settle unless the step size is reduced. Circles show the weights at the start of each epoch and line segments show the single-pattern steps within each epoch.
Figure 5.7: When patterns are presented in a cyclic order during on-line learning, as above, the sequence of steps in epoch t+ 1 tends to be similar to the sequence in epocht so the weight trajectory has a semi-periodic behavior. A danger is that the trajectory will converge to a limit cycle, as shown, and be unable to reach the minimum. One way to break the cycle is to change the pattern selection order. Reduction of the learning rate may also break the cycle or, if done gradually, may reduce the diameter, allowing it to close in around the minimum. Circles show the weights at the start of each epoch and line segments show the single-pattern steps within each epoch.
Figure 6.1: Average training time (solid) and convergence probability (dashed) versus learning rate for α = 0,0.5, 0.9, 0.99. A 4/4/1 network was trained on the 4-bit parity problem with batch back-propagation and the MSE error function. Each point is the average of 100 runs with different random initial weights. Note: Use of the MSE error function rather than SSE normalizes the learning rate by the size of the training set. See section 6.1.1 for simulation details.
Figure 6.2: Actual training times versus learning rate for various momentum values. At high learning rate and momentum, most networks either do not converge or converge to poor solutions and become stuck. Of the networks that do converge, however, most do so very quickly. Each vertical strip shows points for 100 networks initialized with different random weights. Note: Use of the MSE error function rather than SSE scales the learning rate by the size ot the training set. See section 6.1.1 for details of the simulation.
Figure 6.3: Average training time (a) and convergence probability (b) versus normalized learning rate for α = 0, 0.5, 0.9, 0.99. With momentum, the effective learning rate is amplified to η' = η/(1 - α). The same effective learning rate can obtained from different combinations of η and α, which yield roughly the same average training time but differ in the location of the critical point where convergence becomes unlikely. In this example, it appears that small learning rates and large momentums are more stable. Note: Use of the MSE error function rather than SSE normalizes the learning rate by the size of the training set. Simulation details are described in section 6.1.1.
Figure 6.4: E(t) curves for small SSE learning rates. At low learning rates, the E(t) curves are smooth,but convergence is slow. As n increases, convergence time decreases is less reliable with occasional jumps in error. All networks were initialized with the random weights.
Figure 6.5: E(t)curves for near-critical SSE learning rates. Below some critical learning rate, convergence time decreases as the learning rate η increases. At some point though the system becomes unstable and fails to converge. Note that the change is abrupt; at η = 0.18 it converges quickly but at η = 0.19 it overshoots the minimum and appears to get trapped on a plateau. The same network and initial weight vector were used in figure 6.4.
Figure 6.6: With on-line learning, patterns are chosen randomly and the weights are updated after each pattern presentation. This introduces noise into the training process, leading to noisy E(t) curves. In general, larger learning rates cause higher noise levels.
Figure 6.7: E(t)curves for on-line learning. On-line learning is stochastic so different runs with identical initial weights and training parameters will follow different trajectories. Shown are four different trials of on-line back-propagation applied to the 4-input parity problem with η = 0.3, α=0. All networks started with the same initial random weights. This learning rate appears high for this example—the noise level is high and three of the nets get stuck in a local minimum.
Figure 6.8: On-line learning E(t) curves with and without momentum. Momentum also has a learning rate amplification effect in on-line learning. With a small learning rate, η = 0.01, (using SSE) and no momentum, the weight changes are small and convergence is slow. With the same learning rate and α = 0.95, the weight changes are larger and this example converges quickly. Both examples were initialized with the same random weights.
Figure 6.9: With momentum, the current weight change Δ w (t) is a combination of a step down the negative gradient, plus a fraction 0 ≤ α <1 of the previous weight change. For α ≈ 1, opposing weight change components (horizontal) approximately cancel while complementary components (vertical) sum, producing a smoothing effect on the weight trajectory
Figure 6.10: Gradient descent trajectories: (a) with a small step size (0.01), pure gradient descent follows a smooth trajectory, but progress may be very slow; (b) cross-stitching When the step size is too big (0.1) and the error surface has valleys, the trajectory may oscillate wildly from one side to the other while creeping slowly along the length of the valley to the minimum. Upon reaching the neighborhood of a minimum, it may overshoot many times before settling down.
Figure 6.11: Gradient descent trajectories with momentum. With momentum, opposite (side to side) changes tend to cancel while complementary changes (along the length of the valley) tend to sum. The overall effect is to stabilize the oscillations and accelerate convergence. (a) When the step size is small, momentum acts to accelerate convergence (step size 0.01 and momentum 0.99, cf. figure 6.10a). (b) Small amounts of momentum also help to damp oscillations when the step size is too large (step size 0.1 and momentum 0.2, cf. figure 6.10b).
Figure 6.12: At the low end of the momentum range, α increases generally lead to faster convergence. Here the E(t) curves are smooth. Occasional error spikes may occur but the system recovers quickly. All trajectories start from the same random weights.
Figure 6.13: E(t) curves for large momentum values. Convergence becomes unreliable when the momentum is too large for the learning rate η. This system converges quickly with α = 0.6 but not with higher values. With α = 0.99, E(t) oscillates strongly and the system jumps from a poor minimum to a worse one at about t = 100. All trajectories start from the same random weights.
Figure 6.14: Oscillation in E(t) due to momentum. With momentum, the weight vector has inertia, which allows it to coast up hillsides in the E(w) surface. The larger the momentum and the smaller the learning rate, the farther it can rise and the longer it takes to stop. This, in combination with the E(w) surface, can lead to oscillation. Mathematically, smaller &η and larger α give the dynamic system a longer time-constant, visible here in a lower oscillation frequency. (These values were chosen to exaggerate the effect; they are not necessarily recommended.)
Figure 7.1: Initialization from a decision tree. The function computed by the decision tree can be duplicated by a network of linear threshold units [340]. The tree computes a Boolean function of the propositions evaluated at its decision nodes so a two-hidden-layer network is sufficient if the tree's node decisions can be computed by linear threshold elements. The network can be built by inspection of the tree. Training is not required; (a) is a decision tree and (b) the equivalent network.
Figure 8.1: (a) The error surface of a classifier often has many flat plateaus separated by steep cliffs. (b) Increasing the number of samples creates more steps and moves them closer together (adapted from [181], [183]).
Figure 8.2: With a smaller tanh gain (0.1 in this case), the step transitions are smoother (cf.figure 8.1a). Gain scaling is equivalent to scaling all weights by a constant factor though so this does not change the basic shape of the error surface. In the figure, it corresponds to zooming in for a closer view of the origin.
Figure 8.3: Replacing the {-1, 1} targets with {-0.9, 0.9} values produces a small dip at the bottom of each cliff. For illustration purposes, the tanh gain was reduced to 1/2 to make the dip wider and smaller targets were used to make it deeper.
Figure 8.4: When the samples are not linearly separable, the error surface has radial troughs and ridges. A gradient based optimization method could easily get stuck in one of the troughs corresponding to a poor solution (adapted from [181], [183]).
Figure 8.5: Single-pattern weight update vectors. With an SSE cost function, the E(w) surface is the sum of individual surfaces for each pattern and the total gradient is the sum of the single-pattern gradients. On a hillside (a), most of the vectors point in a dominant direction. On a ridge (b) or at the bottom of a valley (c), there are often two bundles of vectors pointing in opposite directions. At a local minima (d) the vectors sum to zero; they may be large and evenly distributed in all directions, or they may all go to zero. Point (e) shows a relatively flat spot.
Figure 8.6: Local and global minima. A global minimum of a function can be defined as its lowest point, the input that gives the lowest possible output. A local minimum is a point that is lower than all its neighbors, but higher than the global minimum.
Figure 8.7: The error surface for a 1/1/1 network trained on the identity mapping illustrates that local minima do in fact exist. This surface has a good minima in the lower left quadrant, a poor minimum in the upper right quadrant, and a saddle point at the origin: (a) the contour plot, and (b) a view looking across the origin down the axis of the poor minimum.
Figure 10.1: Conjugate directions. Powell's method uses the parallel subspace property of quadratic functions to find a set of conjugate directions without evaluating the Hessian. Given a quadratic function xTHx, a direction d, and two points x1 and x2, if one does a line search from x1 along direction d to obtain a minimum y1 and another line search from x2 along direction d to obtain y2 then direction (y2 - y1) is H-conjugate to d [311].
Figure 10.2: Gradient descent trajectories: (a) With a small step size (0.01), gradient descent follows a smooth trajectory, but progress may be very slow. (b) Cross-stitching: When the step size is too big (0.1) and the error surface has valleys, the trajectory may oscillate wildly from one side to the other while creeping slowly along the length of the valley to the minimum. Once it reaches the minimum, it may overshoot many times before settling.
Figure 10.3: Gradient descent trajectories with momentum. With momentum, opposite (side-to-side) changes tend to cancel while complementary changes (along the length of the valley) tend to sum. The overall effect is to stabilize the oscillations and accelerate convergence: (a) When the step size is small, momentum acts to accelerate convergence (step size 0.01 and momentum 0.99, cf. figure 10.2a); (b) small amounts of momentum also help to damp oscillations when the step size is too large (step size 0.1 and momentum 0.2, cf. figure 10.2b).
Figure 10.4: Cauchy's method uses the rule: wherever you are, evaluate the gradient and then move along that line until you find a minimum.In each major iteration, it evaluates the gradient once and then does a line search to find the minimum along that line.
Figure 10.5: The conjugate gradient method converges in just N iterations (2 in this case) for an N-dimensional quadratic error function, but each iteration involves a line search that may require many evaluations of the error function.
Figure 10.6: For quadratic error functions, Newton's method converges in a single step. For more general nonlinear functions, Newton's method converges very quickly where the Hessian matrix is positive definite (e.g., in the vicinity of a minimum) but may diverge elsewhere.
Figure 10.7: The Levenberg-Marquardt method is a compromise between Newton's method, which converges very quickly in the vicinity of a minimum but may diverge elsewhere, and gradient descent, which converges everywhere, albeit slowly. In general,the trajectory starts out like gradient descent but gradually approaches the Newton direction.
Figure 11.1: Selection. The genetic algorithm selects units for reproduction with probability proportional to their fitness. Units with higher fitness (corresponding to more slots on the wheel) are more likely to be chosen than units with lower fitness, but all units have some chance of being chosen.
Figure 11.2: Crossover. Crossover mixes genetic codes inherited from two parents by crossing the bit strings at one or two random points. The bit strings encode characteristics of the parents so the offspring receives traits of both parents, but is not identical to either. The crossing point is random, so the mix of characteristics transferred varies with each mating.
Figure 11.3: The function considered in the example.
Figure 11.4: Representation of a neural network as a LISP expression. W represents a multiplication operation by an input weight and P represents a summation and the node nonlinearity.
Figure 12.1: In many constructive algorithms, new units are added when the rate of improvement of the training error becomes small. Normalization by the magnitude of the error E(to) obtained after the previous unit was added makes the triggering condition less dependent on the size of the error.
Figure 12.2: Cascade-correlation adds hidden units one at a time. Each new unit receives connections (shown as solid dots) from the external inputs and all existing hidden units. Its input weights are chosen to maximize the covariance of its response with the residual output error. Weights into existing hidden units remain unchanged. Then the weights from all the hidden units (new and old) to the output node are retrained to minimize the error. If the resulting error is acceptable, the network is complete; otherwise a new hidden unit is added and the process repeated.
Figure 12.3: The two-spirals problem [233] is sometimes used as a benchmark for constructive algorithms because it requires careful coordination of many hidden nodes and is difficult to learn with simple back-propagation in a MLP network. (It is not representative of most real-world problems, however.) In a single-hidden-layer architecture, 40 or more hidden nodes are generally needed and training times are long. Most successful solutions use more than one hidden layer; some use short-cut connections.
Figure 12.4: The upstart algorithm [128] starts with a single output unit Z. If errors remain after training, then daughter units X and Y are inserted to correct it. Ideally, X corrects Z when it is wrongly ON and Y corrects Z when it is wrongly OFF. If either X or Y cannot correct all the errors assigned to them, additional subunits are introduced to correct their errors and so on. The result is a tree structure for which there is an equivalent single-hidden-layer network.
Figure 12.5: The tiling algorithm [265] adds a new hidden layer at each iteration. A master unit in each layer attempts to classify as many patterns as possible based on input from the preceding layer. If it cannot classify all patterns correctly, then 'ancillary' units are added until the layer produces a 'faithful' representation such that any two input patterns with different targets produce a different pattern of activity on the layer. This then serves as input for the next layer. It is always possible to create a master unit which makes fewer errors than the master unit in the previous layer so convergence is guaranteed if enough layers are added.
Figure 12.6: The algorithm of Marchand, Golea, and Ruj´an [256] creates a single-hidden-layer network of linear threshold units by adding hidden units sequentially. Each new unit slices off a group of training patterns that share the same target value. Filled and empty circles represent training points with positive and negative targets. Lines show the hidden unit decision surfaces; the numbers to the side indicate the order in which the hidden units were created. All points are classified correctly, although some are very close to the boundary.
Figure 12.7: Constructive methods can be based on a Voronoi tessellation of the training points. The Voronoi diagram of a set of base points partitions the space into cells depending on which base point is closest. Each cell is a convex region bounded by hyperplanes. A layered network can be constructed which forms the same partition and generates the required outputs in each cell.
Figure 13.1: Effect of pruning: (a) response of an overtrained 2/50/10/1 network, (b) response after pruning (659 of the 671 original weights and 56 of the 64 original nodes have been removed to produce a 2/2/2/1 network—8 nodes including the bias node), (c) response after further training.
Figure 13.2: Pruning error versus weight saliency of optimal brain damage. Shown are the changes in error due to deletion of a weight versus the OBD saliency for a 10/3/10 network trained on the simple autoencoder problem.
Figure 13.3: Pruning error versus weight magnitude. One of the simplest pruning heuristics is that small weights can be deleted since they are likely to have the least effect on the output. Shown are the errors due to deletion of a weight versus weight magnitude for a 10/3/10 network trained on the simple autoencoder problem. There are two separate, approximately linear trends. The upper group contains input-to-hidden weights while the lower group contains hidden-to-output weights.
Figure 13.4: The normal weight-decay penalty term penalizes large weights heavily, which discourages their use even when they might be helpful. The weight-elimination penalty term in equation 13.19 saturates so large weights do not incur excess penalties once they grow past a certain value. This allows large weights when needed, while still encouraging the decay of small weights.
Figure 13.5: A pruning problem. The original network is underconstrained for the 2-input XOR problem and chooses a correct but unexpected solution. (The training points are the four corners of the square.) A naive pruning algorithm is able to remove many redundant elements, but the resulting network is unlikely to generalize better than the original.
Figure 14.1: Samples alone do not provide enough information to uniquely determine an interpolating function. An infinite number of functions can be fit through the sample points; all are equally valid according to the sample set error and other criteria are needed to choose among them.
Figure 14.2: Underfitting and overfitting by a minimum-MSE approximation using M evenly spaced Gaussian basis functions with widths inversely proportional to M: (a) underfitting (M = 3), the approximation is too simple, (b) perhaps a reasonable fit (M = 5), and (c) possible overfitting (M = 30).
Figure 14.3: Generalization error versus network complexity for a minimum-MSE fit using M evenly spaced Gaussian basis functions for various sample sizes N. For intermediate sample sizes, the curve has a minimum at some intermediate value. (Each point is the average of 200 trials. Training samples are evenly spaced with some jitter but no additive noise. The target function is shown in the inset.)
Figure 14.4: For neural networks used as function approximators, generalization usually means interpolation, not extrapolation. A 1/5/1 network with tanh hidden nodes and a linear output was trained on 50 points uniformly distributed in the first half cycle of f(x) = sin(2πx). The fit is good near the training data, but poor elsewhere.
Figure 14.5: A 2/50/10/1 network with 671 weights trained on 31 points is very underconstrained. Although the points are nearly linearly separable, with some overlap, the decision surface is very nonlinear and is unlikely to generalize well. The solid line shows the network decision surface, the 0.5 contour. Other contours are omitted because the transitions are steep.
Figure 14.6: Generalization error versus sample size N for the Gaussian basis function approximation system of figure 14.3.
Figure 14.7: Training and test set errors versus sample size N for selected M values of the Gaussian basis function approximation system of figure 14.3. In general, the training set error is small for small sample sizes and rises as the number of samples increases while the test set error is large for small sample sizes and decreases as the number of samples increases. At large N, both approach the same asymptotic value.
Figure 14.8: Generalization error versus M for various values of the additive noise variance σ. With a fixed sample size and increasing amounts of additive noise, minima of the generalization error curves occur at lower M values. The fitting function is the system of Gaussian basis functions used in figure 14.3. Training sample size N = 20. Each point is the average of 200 trials.
Figure 14.9: As training progresses, the generalization error may decrease to a minimum and then increase again as the network adapts to idiosyncrasies of the training data. Training and test errors are shown for an underconstrained network trained by back-propagation on 30 examples of the iris data and tested on the 120 remaining examples. (An underconstrained net and small learning rate were used to demonstrate overtraining. Smaller networks can learn the data in a much shorter time.)
Figure 14.10: One explanation of overtraining is that the training error surface is similar to the true error surface but somewhat distorted so the apparent minimum is offset from the true minimum. The figure shows hypothetical error surfaces. Depending on how the network is initialized, the weight trajectory may pass over a true minimum on its way the apparent (training error) minimum.
Figure 14.11: Different training algorithms may yield different generalization results. Shown are responses for a 2/2/1 net trained by various algorithms on the 2-bit XOR problem. Batch and on-line training appear to give reasonably smooth symmetric responses here (values in parentheses are the learning rate and momentum). RProp and conjugate gradient descent (CGD) appear to create less symmetric surfaces and allow sharp transitions near the training points at the corners. Networks (h), (j), and (k) appear sensitive to the training data and are unlikely to generalize well if the input values are changed slightly.
Figure 16.1: Effect of training with weight decay. A 2/50/10/1 network was trained using normal back-propagation for 200 epochs. Then weight decay was set to 1E-4 and training resumed for a total of 5000 epochs. Unlike figure 14.5, the decision surface is very simple and does not show obvious signs of overtraining.
Figure 16.2: Effects of weight decay: (a) response of a 2/50/10/1 network trained by batch back-propagation until all patterns were correctly classified at about 11,000 epochs; (b) response after 20,000 epochs of a network trained from the same starting point with weight decay increasing to 0.001 at 4000 epochs; and (c) response of the network in (b) after 1000 more epochs with the learning rate decreased to 0.01.
Figure 17.1: Convolution tends to be a smoothing operation. A step function, t(x), convolved with a Gaussian, pn(x), produces the Gaussian cumulative distribution. This resembles the original step function, but it is a smooth function similar to the sigmoid.
Figure 17.2: A nearest neighbor classification problem: (a) the Voronoi map for 14 points; and (b) the expected value of the classification given the noisy input as calculated in (17.5) for a Gaussian noise distribution with Σ = 0.1. (c) The convolution of the training set with the same Gaussian noise. The zero contours of (b) and (c) coincide. (Is and Os indicate classes; values 1 and -1 were actually used.)
Figure 17.3: Smoothing effects of training with jitter: (a) an intentionally overtrained 2/50/10/1 feedforward network chooses an overly complex boundary and can be expected to generalize poorly; (b) the same network trained with Gaussian (Σ = 0.1) input noise forms a much smoother boundary and better generalization can be expected; and (c) the expression in equation 17.5 for the expected value of the target function at an arbitrary point x.
Figure 17.4: The conventional sigmoid 1/(1 + e-x) and the Gaussian cumulative distribution function (GCDF) (with Σ = 4/2π) have very similar shapes and give similar results when used as the node nonlinearities. The GCDF is useful in this analysis because it is shape invariant when convolved with a spherical Gaussian noise density.
Figure 17.5: Equivalence of weight scaling and jitter averaging: (a) the transfer function of the original network and (b) its contour plot; (c) the average response with additive Gaussian input noise, σ = 0.1, averaged over 2000 noisy samples per grid point and (d) its contour plot; and (e) the expected response computed by scaling and (f) its contour plot.
Figure 17.6: Weight-decay effects of training with jitter: (a) weights for the overtrained network of figure 17.3(a), σ = 0.7262; and (b) weights for the jitter-trained network of figure 17.3(b), σ = 0.3841.
Figure 17.7: Training with static noise: (a) response of an underconstrained 2/50/10/1 net. The boundary is complex and transitions are steep, but the data is almost linearly separable. (b) Response of the same net trained on an enlarged data set obtained by replacing each original training point by 30 noisy points (σ = 0.1). The boundary is simpler and transitions are more gradual, but a few kinks remain. (1s and 0s denote the training points, the training values were 0.9 and -0.9.)
Figure 17.8: Cross-validation with jittered data. An artificial validation data set was created by generating 30 jittered points from each of the original 31 training points: (a) response of an underconstrained 2/50/10/1 net trained to convergence, and (b) response of the net with the best RMS error on the validation set (1s and 0s denote the training points, the training values were 0.9 and -0.9).
Figure 17.9: Smoothing an overtrained response. Given an overtrained net, a better estimate of the true function at a point x might be obtained by averaging a number of probes around x using a oisy input. The figure shows the expected response of the network in figure 17.7(a) to a noisy input (σ = 0.1) (1s and 0s denote the training points, the training values were 0.9 and -0.9).
Figure B.1: Correlated data are partially redundant because knowledge of one variable gives approximate knowledge of the other variables. In two dimensions, perfectly correlated data lie on a straight line; in n dimensions, they lie on a lower dimensional subspace.
Figure B.2: Effect of nonzero-mean on the eigenvectors of the correlation matrix. (a) For zero-mean data, the eigenvectors of the correlation matrix indicate the main axes of data variation and the eigenvalues reflect the variance along each dimension. (b) For nonzero-mean data, the eigenvectors are rotated and the eigenvalues are influenced by the length of the offset vector.
Figure B.3: An autoencoder network maps an input vector to itself. Given an input, the goal is to reproduce it at the output. A small hidden layer acts as a bottleneck and forces the network to find a compressed representation of the input pattern. With a linear network and a least squares error function, the ideal internal representation is related to the principal components of the training data.
Figure B.4: Although principal components analysis sometimes produces directions useful for discriminating between classes, this is not always true. In (a) the main contribution to the variance of the input data comes from the separation between class means so the principal component directions are useful for discriminating between the classes. In (b) however, the major axis of variation is along direction v1 but the classes are separated along the minor direction v2.
Figure B.5: The projection vectors found by discriminant analysis (MDA) and principal components analysis (PCA) for a simple data set.
Figure B.6: Projection histograms for the MDA and PCA directions shown in figure B.5: (a) the principal components projection shows cluster overlap, and (b) the discriminant analysis projection separates the clusters well.
Figure B.7: A single-hidden-layer linear network trained to perform classification with a 1-of-N target representation implements a form of discriminant analysis [385, 134].

Top of current section
Previous Section Next Section
Books24x7.com, Inc. © 1999-2001  –  Feedback