← All methods

Distillation

Compute interpretation

Train-expensive/deploy-compact pattern that transfers behavior from large teachers or ensembles into cheaper students.

Supporting reading cards

Distilling the Knowledge in a Neural Network (2015, efficient_edge_inference)
SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size (2016, efficient_edge_inference)
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019, efficient_edge_inference)
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (2025, inference_time_compute_post_training)
Kimi k1.5: Scaling Reinforcement Learning with LLMs (2025, inference_time_compute_post_training)
s1: Simple test-time scaling (2025, inference_time_compute_post_training)
Qwen3 Technical Report (2025, hyperscale_dense_llm_training)
Gemma 3 Technical Report (2025, efficient_edge_inference)

Obsolete or less central under later compute

Track this only through linked reading cards; do not treat this method page as standalone evidence.