A fast learning algorithm for deep belief nets

Download PDF

A fast learning algorithm for deep belief nets

Metadata

Compute Setup

The paper gives a rare explicit CPU-era device statement. The MNIST network was trained in Matlab on a 3GHz Xeon processor. Greedy layer-by-layer training took "a few hours per layer" on that machine, and after up-down fine-tuning plus additional training on all 60,000 examples, total learning time was about a week. This places the result firmly before commodity GPU deep learning.

The model scale was modest by later standards but large for the time. The architecture in the table is 784 -> 500 -> 500 <-> 2000 <-> 10, with roughly 1.7M weights according to the paper. The MNIST setup uses 60,000 training images and 10,000 official test images. Initial training used 44,000 images divided into 440 balanced mini-batches, each containing 10 examples of each digit, with weights updated after each mini-batch. Each RBM layer was trained for 30 sweeps through the training set, then the whole network was fine-tuned for 300 epochs with the up-down algorithm, followed by 59 more epochs on all 60,000 training images.

Bottleneck

The bottleneck is probabilistic inference and credit assignment in deep directed models under CPU constraints. Densely connected directed belief nets suffer from explaining away, which makes posterior inference over hidden variables intractable except in special cases. MCMC could in principle sample from the posterior, but it is too expensive to put inside every learning update on a 3GHz CPU. Standard variational approximations can be poor, especially at high hidden layers.

The hardware constraint shows up in the experimental choices. MNIST is treated as a permutation-invariant problem: no convolution, weight sharing, or geometric preprocessing is used. That makes the result a clean test of deep generative learning, but it also means the model cannot exploit the image-specific locality that would later make CNNs efficient on GPUs. The authors explicitly say they had not explored distorted-data augmentation for their generative model because many distortion types would need investigation and the fine-tuning algorithm was currently too slow.

Method Adaptation

The method is a compute adaptation as much as a learning algorithm. Instead of fitting a deep directed network all at once, the paper trains one layer at a time as a restricted Boltzmann machine. Contrastive divergence replaces full Boltzmann-machine maximum likelihood because it works well and is much faster. After a layer is trained, its hidden activations become the "data" for the next layer. This greedy pretraining gives the deep model weights that are already in a good region before the slow global fine-tuning begins.

The architecture is also designed for parallel local updates, even though the reported implementation is CPU Matlab. In an RBM, units within a layer have no intra-layer connections, so all hidden units can be updated in parallel given visible states, and all visible units can be updated in parallel given hidden states. The implementation did not exploit modern GPU kernels, but the algorithmic shape anticipated later accelerator-friendly neural computation.

After greedy training, the paper unties recognition and generative weights and applies a contrastive version of wake-sleep, called the up-down algorithm. The fine-tuning remains expensive, but the greedy stage reduces the search problem enough that fine-tuning becomes feasible within about a week rather than being hopelessly unstable or too slow from random initialization.

Evidence

The paper's evidence is tightly linked to its compute claim. On the basic, permutation-invariant MNIST task, the greedy pretrained network initially reaches 2.49% test error after a few hours per layer. After 300 epochs of up-down fine-tuning and the final 59 epochs on all 60,000 training cases, it reaches 1.25% error on the official 10,000-image test set. The authors report an intermediate validation-selected network at 1.39%, then the final 1.25% network after full-data training.

The comparison table makes the result historically significant. The generative model at 784 -> 500 -> 500 <-> 2000 <-> 10 reaches 1.25% on the permutation-invariant setting. The paper cites a support vector machine at 1.4%, backprop networks at 1.51% and 1.53%, and another backprop baseline at 2.95%. The authors also acknowledge that domain-specific image methods with distorted data and convolutional structure can do better, including LeNet-style results around 0.95% and later distorted-data methods below that. But those methods use image priors or augmentation; the DBN result shows that unsupervised layerwise pretraining could make a deep, dense generative model competitive on CPU-era hardware.

The training schedule is itself evidence of the bottleneck. Raising Gibbs iterations in the associative memory from three to six to ten during successive 100-epoch blocks reduced validation error, but each increase added sampling cost. Better approximate inference improved the model, but every extra Gibbs step consumed scarce wall-clock time.

Historical Effect

Deep belief nets made deep learning look trainable again before the GPU/CNN breakthrough. Their historical role was not that RBMs became the final architecture for vision or language, but that greedy unsupervised pretraining provided a practical recipe under limited compute and poor random-initialization behavior. The method decomposed a hard global optimization problem into tractable local ones, then used a slower global pass.

In the compute spine, this is a pre-2012 CPU foundation. The paper shows a characteristic pattern of the era: clever probabilistic structure and layerwise training compensated for weak devices. Later GPUs reduced the need for this exact pretraining scheme, but the paper helped establish depth, learned representations, minibatch training, and generative pretraining as central ideas.

Limits

The limits are largely compute and inductive-bias limits. Training takes about a week in Matlab on a 3GHz Xeon for MNIST, a small dataset by modern standards. The fine-tuning algorithm is explicitly too slow for broad exploration of distorted-data augmentation, and the paper does not demonstrate scaling to natural images, speech, or larger corpora. Because the reported model is permutation-invariant, it leaves performance on the table relative to convolutional methods that use image geometry.

The learning procedure also remains complex. It combines RBM pretraining, contrastive divergence, stochastic binary units, an associative memory, Gibbs sampling schedules, validation-selected hyperparameters, and up-down fine-tuning. The paper makes deep generative learning feasible on CPU hardware, but not simple, fast, or broadly automated. Later accelerator-era backpropagation would replace much of this machinery with larger datasets, rectified units, normalization, and direct supervised or self-supervised training.

Links

  • Compute regime: history/compute_regimes/pre_2012_cpu_statistical_foundations/README.md
  • Source PDF and extracted text are listed in metadata above.
  • Queue status: read_complete.