Distillation
Compute interpretation
Train-expensive/deploy-compact pattern that transfers behavior from large teachers or ensembles into cheaper students.
Supporting reading cards
- Distilling the Knowledge in a Neural Network (2015,
efficient_edge_inference) - SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size (2016,
efficient_edge_inference) - DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019,
efficient_edge_inference) - DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025,
inference_time_compute_post_training) - Kimi k1.5: Scaling Reinforcement Learning with LLMs (2025,
inference_time_compute_post_training) - s1: Simple test-time scaling (2025,
inference_time_compute_post_training) - Qwen3 Technical Report (2025,
hyperscale_dense_llm_training) - Gemma 3 Technical Report (2025,
efficient_edge_inference)
Obsolete or less central under later compute
Track this only through linked reading cards; do not treat this method page as standalone evidence.