DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Metadata

Reading status: read complete
Year: 2019
Compute regime: Efficient and edge inference (efficient_edge_inference)
PDF: 2019-distilbert_2019.pdf
Extracted text: 2019-distilbert_2019.txt
PDF URL: https://arxiv.org/pdf/1910.01108.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The paper explicitly gives both pretraining and inference devices. DistilBERT is trained on English Wikipedia plus Toronto Book Corpus on "8 16GB V100 GPUs" for approximately 90 hours. The training recipe uses dynamic masking, drops next sentence prediction, and relies on gradient accumulation for very large effective batches, up to 4K examples per batch. The paper contrasts this with RoBERTa, reported as requiring 1 day on 1024 32GB V100 GPUs, so distillation lowers the accelerator envelope without making pretraining free.

For inference, the paper times a full pass over GLUE STS-B on an Intel Xeon E5-2690 v3 Haswell at 2.9GHz with batch size 1. It also builds a question-answering application and compares BERT-base and DistilBERT on an iPhone 7 Plus, excluding tokenization from measured neural inference time.

Bottleneck

BERT-base is large enough to stress deployment budgets: 110M parameters, a deep stack of Transformer blocks, and attention/MLP layers run for every token and every layer. The paper frames growing compute and memory requirements as a barrier to wider adoption. The bottleneck is not only training FLOPs; it is parameter storage, activation movement through many layers, and low-batch inference latency.

The STS-B CPU measurement exposes the batch-size-one case where batching cannot hide per-layer overhead. The iPhone QA measurement adds local storage and memory pressure. DistilBERT targets the part of BERT that most directly scales wall-clock time: depth. The authors state that changing hidden size has less effect on computation efficiency than changing factors such as number of layers, because many linear algebra kernels are already well optimized.

Method Adaptation

DistilBERT keeps the BERT Transformer layout but makes the student shallower. It removes token-type embeddings and the pooler and cuts the number of layers by a factor of two. That preserves the dense GEMM-friendly structure of BERT while reducing the number of sequential layer-normalization, attention, and feed-forward passes.

Training is shaped by the same limits. DistilBERT learns from BERT's soft token predictions, masked language modeling, and a cosine loss aligning student and teacher hidden-state directions. It initializes from every other teacher layer, exploiting shared hidden dimensionality, and uses gradient accumulation to reach up to 4K-example effective batches on the 8 V100s.

Evidence

The main evidence is a size/latency/accuracy trade-off. On GLUE dev, BERT-base scores 79.5 and DistilBERT scores 77.0, summarized as retaining 97% of BERT performance with 40% fewer parameters. On downstream tasks, BERT-base reports 93.46 IMDb accuracy and 81.2/88.5 SQuAD EM/F1; DistilBERT reports 92.82 IMDb and 77.7/85.8 SQuAD, with an additional SQuAD distillation step improving to 79.1/86.9.

The compute evidence is direct. In Table 3, ELMo has 180M parameters and takes 895 seconds for the STS-B CPU pass; BERT-base has 110M parameters and takes 668 seconds; DistilBERT has 66M parameters and takes 410 seconds. For mobile QA, excluding tokenization, DistilBERT is 71% faster than BERT-base on an iPhone 7 Plus, and the model is 207 MB before quantization. The ablation table supports the recipe: random initialization costs 3.69 GLUE macro-score points relative to triple loss with teacher initialization.

Historical Effect

DistilBERT helped turn pretrained-model compression into a default NLP engineering pattern. It applied distillation at the pretraining stage and tied it directly to deployment metrics: fewer parameters, lower CPU latency, lower mobile latency, and a model size that can plausibly ship to a phone. The compute lesson is that compression should target the dominant runtime dimension. For BERT-style encoders, depth is a sequential dependency chain, so halving layers buys latency in a way that small hidden-size changes may not.

Limits

The compression is substantial but not absolute. A 207 MB model can still be large for mobile distribution, and the authors note that quantization could reduce it further. The iPhone timing excludes tokenization, so end-to-end latency depends on preprocessing and serving code. DistilBERT also gives up task accuracy: SQuAD loses several EM/F1 points before task-specific distillation, and RTE is much weaker than BERT-base.

The training hardware is also not negligible. Eight 16GB V100s for about 90 hours is much cheaper than some contemporary large pretraining runs, but it is still cluster training rather than commodity laptop training. DistilBERT should be read as an efficient deployment and reuse strategy, not as a claim that high-quality pretrained encoders can be produced without accelerator-backed pretraining.