Quantization
英文原文文件:quantization.md
计算解释
通过降低数值精度来缓解内存、带宽与延迟压力的部署技术。
支撑阅读卡
- MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (2017,
efficient_edge_inference) - GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (2022,
efficient_edge_inference) - QLoRA: Efficient Finetuning of Quantized LLMs (2023,
efficient_edge_inference) - Gemma 3 Technical Report (2025,
efficient_edge_inference) - BitNet b1.58 2B4T Technical Report (2025,
efficient_edge_inference)
后续计算范式下过时或退居次要的内容
仅通过已链接的阅读卡追踪,不将本方法页视为独立证据来源。