← All methods

Quantization

Compute interpretation

Deployment adaptation that reduces memory, bandwidth, and latency pressure by lowering numerical precision.

Supporting reading cards

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications (2017, efficient_edge_inference)
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers (2022, efficient_edge_inference)
QLoRA: Efficient Finetuning of Quantized LLMs (2023, efficient_edge_inference)
Gemma 3 Technical Report (2025, efficient_edge_inference)
BitNet b1.58 2B4T Technical Report (2025, efficient_edge_inference)

Obsolete or less central under later compute

Track this only through linked reading cards; do not treat this method page as standalone evidence.