Efficient and edge inference
Deployment constraints dominate: latency, memory footprint, quantization, adapter size, KV-cache pressure.
10 papers
Regime 9 of 10
Efficient and edge inference
Device/setup
Deployment-constrained CPUs, mobile SoCs, single GPUs, and memory-limited finetuning/serving environments.
Bottleneck
Latency, model size, bandwidth, VRAM/RAM footprint, and adaptation cost matter more than maximum training throughput.
Methods that fit
Knowledge distillation, SqueezeNet/MobileNet, DistilBERT, LoRA, QLoRA, GPTQ, and speculative decoding make models smaller, cheaper to adapt, or cheaper to serve.
Methods that became obsolete or less central
Full-model finetuning and unconstrained dense serving became less central when deployment memory or latency is fixed.
Representative papers
| Rank | Year | Paper | Priority | Status |
|---|---|---|---|---|
| 103 | 2015 | Distilling the Knowledge in a Neural Network | 4 | downloaded / read_complete |
| 104 | 2017 | MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications | 4 | downloaded / read_complete |
| 105 | 2021 | LoRA: Low-Rank Adaptation of Large Language Models | 4 | downloaded / read_complete |
| 106 | 2023 | QLoRA: Efficient Finetuning of Quantized LLMs | 4 | downloaded / read_complete |
| 107 | 2023 | Fast Inference from Transformers via Speculative Decoding | 4 | downloaded / read_complete |
| 108 | 2016 | SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size | 3 | downloaded / read_complete |
| 109 | 2019 | DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter | 3 | downloaded / read_complete |
| 110 | 2022 | GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers | 3 | downloaded / read_complete |
| 125 | 2025 | Gemma 3 Technical Report | 4 | downloaded / read_complete |
| 126 | 2025 | BitNet b1.58 2B4T Technical Report | 4 | downloaded / read_complete |
Open questions
- Connect edge/mobile efficiency methods with modern LLM finetuning and decoding-time efficiency.