Quantization
Compute interpretation
Deployment adaptation that reduces memory, bandwidth, and latency pressure by lowering numerical precision.
Supporting reading cards
- MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (2017,
efficient_edge_inference) - GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (2022,
efficient_edge_inference) - QLoRA: Efficient Finetuning of Quantized LLMs (2023,
efficient_edge_inference) - Gemma 3 Technical Report (2025,
efficient_edge_inference) - BitNet b1.58 2B4T Technical Report (2025,
efficient_edge_inference)
Obsolete or less central under later compute
Track this only through linked reading cards; do not treat this method page as standalone evidence.