Learning representations by back-propagating errors

Metadata

Reading status: read complete
Year: 1986
Compute regime: Pre-2012 CPU and statistical foundations (pre_2012_cpu_statistical_foundations)
PDF: 1986-backprop_errors_1986.pdf
Extracted text: 1986-backprop_errors_1986.txt
PDF URL: https://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf
OpenAlex: https://openalex.org/W1498436455
Citation count source/date: OpenAlex 2026-06-15
Citation count: 30757
Reading card created: 2026-06-15

Compute Setup

The paper does not name a machine, processor, memory size, runtime, or wall-clock training time. The linked extracted text file is effectively empty, so this card is grounded in the linked local PDF page images. Under the project rule, the setting is inferred as a 1985-1986 CPU research workstation or minicomputer environment with tiny enumerated datasets and no GPU/accelerator. The paper discusses compatibility with parallel hardware, but does not claim the reported experiments used it.

The compute scale is explicit. The symmetry example learns over 64 binary input vectors and requires 1,425 sweeps, with weights adjusted from accumulated gradients after each sweep. The family-tree example trains a five-layer feed-forward network on 100 of 104 possible triples for 1,500 sweeps, with changed epsilon/alpha settings after the first 20 sweeps and weight decay after each weight change.

Bottleneck

The bottleneck is credit assignment in hidden layers. If only input and output units have externally specified states, a learning system needs a way to decide how each internal unit should change. Earlier perceptron-style procedures could adjust direct input-output weights, but hidden units made learning harder because their desired states were not provided by the task. The paper frames this as the difference between fixed "feature analysers" and truly hidden units.

There is also a compute bottleneck. For a finite case set, the total error sums differences across every case and output unit, and learning requires partial derivatives with respect to every weight. Backpropagation reuses local derivatives layer by layer so the full gradient can be computed by one forward pass followed by one backward pass.

Method Adaptation

The method is a compute graph procedure for layered differentiable units. In the forward pass, each layer's states are computed from lower-layer outputs and weights; units within a layer can be set in parallel, while layers are set sequentially from bottom to top. Once output activations are known, the backward pass computes output derivatives, then uses the chain rule to propagate derivatives to earlier layers and compute dE/dw for each weight.

The paper distinguishes two update modes. One can change weights after every case, avoiding separate derivative memory, but the reported experiments accumulate dE/dw over all cases before changing weights. The acceleration method adds a momentum-like term: the current gradient modifies velocity in weight space, with alpha controlling contribution from earlier gradients.

The recurrent-net discussion makes the memory issue explicit: unrolled intermediate states must be stored for the backward pass, and tied weights across layers require averaging their gradients before updating the shared weight.

Evidence

The symmetry task detects mirror symmetry in a six-dimensional binary input vector. Because there are 64 possible inputs, the authors train on the complete input space. The learned solution uses two hidden units and one output unit; Figure 1 states that training required 1,425 sweeps through the 64 cases. The learned weights encode symmetry by giving mirrored input positions equal-magnitude, opposite-sign effects on the hidden units.

The family-tree task encodes triples of person 1, relationship, and person 2 across two isomorphic family trees. Figure 3 describes a five-layer network with 24 input units for the first person, 12 for the relationship, hidden layers for people and relations, a central layer of 12, a penultimate layer of 6, and output units for the second person. The paper reports training on 100 of 104 triples for 1,500 sweeps and generalizing correctly to the four withheld triples. Figure 4 interprets hidden units as capturing family branch/generation and English/Italian distinctions not explicit in the labels.

The source also gives a compute caveat: the procedure is not rapid second-order optimization. The authors say the simple method does not converge as fast as methods that use second derivatives, but is much simpler and can be implemented by local computations in parallel hardware.

Historical Effect

This paper established backpropagation as a practical way to train internal representations in layered differentiable networks. Its compute contribution is the reusable backward pass: once forward activations are stored, gradients for all weights can be computed by local chain-rule operations. That made hidden units trainable without manually specifying intermediate targets.

The small experiments demonstrated the qualitative effect: symmetry detectors and distributed family-tree features emerged from gradient computation rather than hand design. Later neural networks used vastly larger devices and datasets, but the same forward/backward computational pattern remained central.

Limits

The paper reports toy-scale demonstrations and no named hardware. There is no accelerator, memory, throughput, or wall-clock evidence in the source, so any device statement beyond the era-level inference would be unsupported. The experiments use complete or nearly complete enumerated case sets, not noisy large-scale datasets.

The authors also note optimization limits. Gradient descent is not guaranteed to find a global minimum, and poor local minima are possible, especially in networks with just enough connections. The procedure is also not presented as a plausible full model of biological learning. Finally, storing intermediate activations for backward computation is already visible as a requirement; at 1986 toy scale this is small, but it becomes a central memory cost of later deep learning.