← 全部范式 / 09

高效推理与边缘部署

部署约束成为主要瓶颈：延迟、内存占用、量化、适配器大小和 KV 缓存压力。

10 篇论文第 9 个，共 10 个计算范式

Efficient and edge inference

英文原文文件：README.md

设备/设置

部署受限的 CPU、移动 SoC、单 GPU，以及内存受限的微调/服务环境。

瓶颈

延迟、模型大小、带宽、VRAM/RAM 占用和适配成本比最大训练吞吐更重要。

适配的方法

知识蒸馏、SqueezeNet/MobileNet、DistilBERT、LoRA、QLoRA、GPTQ 和 speculative decoding 使模型更小、适配成本更低，或降低服务成本。

变得过时或不再中心的方法

当部署内存或延迟固定时，全模型微调和无约束密集服务的重要性随之下降。

代表性论文

排名	年份	论文	优先级	状态
103	2015	Distilling the Knowledge in a Neural Network	4	downloaded / read_complete
104	2017	MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications	4	downloaded / read_complete
105	2021	LoRA: Low-Rank Adaptation of Large Language Models	4	downloaded / read_complete
106	2023	QLoRA: Efficient Finetuning of Quantized LLMs	4	downloaded / read_complete
107	2023	Fast Inference from Transformers via Speculative Decoding	4	downloaded / read_complete
108	2016	SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size	3	downloaded / read_complete
109	2019	DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter	3	downloaded / read_complete
110	2022	GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers	3	downloaded / read_complete
125	2025	Gemma 3 Technical Report	4	downloaded / read_complete
126	2025	BitNet b1.58 2B4T Technical Report	4	downloaded / read_complete

开放问题

串联边缘/移动效率方法与现代 LLM 微调及解码效率方向。

相关论文 10

2015 Distilling the Knowledge in a Neural Network

2017 MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

2021 LoRA: Low-Rank Adaptation of Large Language Models

2023 QLoRA: Efficient Finetuning of Quantized LLMs

2023 Fast Inference from Transformers via Speculative Decoding

2016 SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size

2019 DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

2022 GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

2025 Gemma 3 Technical Report

2025 BitNet b1.58 2B4T Technical Report