Efficient and edge inference

Deployment constraints dominate: latency, memory footprint, quantization, adapter size, KV-cache pressure.

10 papers Regime 9 of 10

Efficient and edge inference

Device/setup

Deployment-constrained CPUs, mobile SoCs, single GPUs, and memory-limited finetuning/serving environments.

Bottleneck

Latency, model size, bandwidth, VRAM/RAM footprint, and adaptation cost matter more than maximum training throughput.

Methods that fit

Knowledge distillation, SqueezeNet/MobileNet, DistilBERT, LoRA, QLoRA, GPTQ, and speculative decoding make models smaller, cheaper to adapt, or cheaper to serve.

Methods that became obsolete or less central

Full-model finetuning and unconstrained dense serving became less central when deployment memory or latency is fixed.

Representative papers

Rank Year Paper Priority Status
103 2015 Distilling the Knowledge in a Neural Network 4 downloaded / read_complete
104 2017 MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications 4 downloaded / read_complete
105 2021 LoRA: Low-Rank Adaptation of Large Language Models 4 downloaded / read_complete
106 2023 QLoRA: Efficient Finetuning of Quantized LLMs 4 downloaded / read_complete
107 2023 Fast Inference from Transformers via Speculative Decoding 4 downloaded / read_complete
108 2016 SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size 3 downloaded / read_complete
109 2019 DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter 3 downloaded / read_complete
110 2022 GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers 3 downloaded / read_complete
125 2025 Gemma 3 Technical Report 4 downloaded / read_complete
126 2025 BitNet b1.58 2B4T Technical Report 4 downloaded / read_complete

Open questions

  • Connect edge/mobile efficiency methods with modern LLM finetuning and decoding-time efficiency.

Papers in this compute regime 10