Gradient-based learning applied to document recognition

Download PDF

Gradient-Based Learning Applied to Document Recognition

Metadata

  • Slug: lenet_document_recognition_1998
  • Year: 1998
  • Venue: Proceedings of the IEEE
  • Authors: Yann LeCun, Leon Bottou, Yoshua Bengio, Patrick Haffner
  • Reading status: read complete, extraction poor
  • Compute regime: Pre-2012 CPU and statistical foundations
  • Primary sources: PDF, extracted text

Compute Setup

The extracted text is badly corrupted and does not reliably expose a hardware section. I checked the local extracted text and the PDF text output; no explicit processor, accelerator, or training-machine claim is recoverable enough to cite. By the project rule, the device context is inferred from the 1997-1998 research period: CPU/workstation-era neural-network training and document-recognition deployment, with no GPU evidence in the usable local source.

The source still gives useful compute structure. The readable architecture figure labels LeNet-5 as a pipeline from 32x32 input through convolution, subsampling, convolution, subsampling, fully connected layers, and a 10-class output. The visible labels include C1 feature maps 6@28x28, C3 feature maps 16@10x10, S4 feature maps 16@5x5, C5 with 120 units, F6 with 84 units, and output 10. The readable comparison fragments also list LeNet-5 alongside values consistent with about 60 thousand trainable parameters and about 401 thousand connections, while boosted LeNet-4 appears around 51 and 460 in the same table fragments. Because the prose is corrupted, these should be treated as table-fragment evidence rather than a clean hardware/runtime report.

Bottleneck

The bottleneck is practical document recognition under limited 1990s compute: robust end-to-end recognition of handwritten and machine-printed material without relying on hand-built features for every document variation. A fully connected neural net over 28x28 or 32x32 pixels spends parameters on arbitrary pixel-to-hidden connections and ignores image locality. Template methods and k-NN can improve accuracy but often push cost to inference-time search or deformation matching.

LeNet's compute bottleneck is therefore parameter efficiency and local invariance. The model needs enough capacity to recognize shifted and distorted digits, but the training and inference budget is CPU-era. Weight sharing and local receptive fields reduce parameter count, while subsampling reduces spatial resolution as features become more abstract. The architecture is a response to memory, arithmetic, and data-regime constraints at once.

Method Adaptation

LeNet-style convolutional networks fit this setup by using local receptive fields and shared weights, so one learned detector is reused across spatial positions. This converts image recognition from a large dense matrix problem into repeated small-kernel computations over feature maps. On CPUs, that trades arbitrary connection memory for regular loops over arrays. It also improves sample efficiency because each filter sees many spatial examples per image.

The alternating convolution/subsampling structure adapts to both invariance and compute. C1 produces six 28x28 feature maps from the 32x32 input; later stages reduce resolution through subsampling and increase feature diversity through more maps. By the time the model reaches C5 and F6, the spatial grid is small enough for fully connected classification. The architecture therefore spends dense computation only after convolutional stages have compressed the image.

The paper also situates LeNet inside larger document-recognition systems. The card should be read as a document-processing compute story, not just MNIST classification. A compact trainable recognizer can be composed with segmentation, graph-transformer-style parsing, and global training of document pipelines. The same principle recurs later in deep learning: move hand-engineered stages into differentiable components when the compute budget can support it.

Evidence

Only table and figure fragments are reliable in the extracted text. The clearest benchmark fragment compares classifiers on digit recognition. It lists 28x28 fully connected neural nets at 4.7%, 4.5%, 3.05%, and 2.95% depending on hidden-layer sizes, with distortion variants improving some values. The same fragment lists LeNet-1 at 1.7%, LeNet-4 at 1.1%, LeNet-4 with local or k-NN last layers at 1.1%, LeNet-5 at 0.95%, distorted LeNet-5 at 0.8%, and boosted distorted LeNet-4 at 0.7%. This is the main source-backed evidence that convolution plus subsampling bought accuracy at a compact model scale.

The architecture figure gives the compute mechanism behind the error table: 32x32 input, 6@28x28 C1 maps, 16@10x10 C3 maps, 16@5x5 S4 maps, a 120-unit C5 layer, an 84-unit F6 layer, and a 10-unit output. The table fragments around the model comparison also show LeNet models with far fewer learned parameters than large fully connected baselines while keeping error lower. Because the local extraction is mojibake-heavy, the safest claim is qualitative-plus-fragmentary: LeNet's structure reduces dense connectivity and improves recognition accuracy, but exact prose claims about runtime or hardware should not be made from this source.

Historical Effect

This paper is the CPU-era convolutional network anchor. It established weight sharing, subsampling, and end-to-end gradient training as a device-efficient visual recognition structure long before GPU deep learning made large CNNs mainstream. In compute-structure terms, LeNet is important because it shows that architecture can substitute for brute force: exploit locality and translation structure to get useful accuracy within a small-parameter, CPU-era budget.

Its influence runs directly to later GPU CNNs. AlexNet did not invent convolutional compute; it scaled an already-established architecture family with more data, larger models, ReLUs, GPUs, and high-throughput training. LeNet is the earlier proof that convolutional weight sharing is the right shape for visual data.

Limits

The local extracted text is mojibake-heavy; numbers and figure labels are partially readable but prose is unreliable. Hardware and runtime claims should be checked against a cleaner source before being used in summaries. The paper predates ImageNet-scale supervised visual training, commodity GPU acceleration, and modern data augmentation pipelines. Its benchmark evidence is strongest for small grayscale digit/document-recognition tasks, not for large-scale natural-image recognition.

Links