GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Download PDF

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Metadata

  • Slug: gptq_2022
  • Year: 2022
  • Venue: ICLR
  • Authors: Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh
  • Reading status: read complete
  • Compute regime: Efficient and edge inference (efficient_edge_inference)
  • Primary sources: PDF, extracted text
  • Reading card created: 2026-06-15

Compute Setup

The paper explicitly names calibration and inference hardware. All models, including OPT-175B and BLOOM-176B, are quantized on a single NVIDIA A100 80GB. Inference latency is reported on NVIDIA A100 80GB and NVIDIA A6000 48GB GPUs. The baseline memory problem is stark: 175B FP16 models occupy about 326GB, exceeding even the largest single GPUs and requiring multi-GPU inference when left uncompressed.

The calibration dataset is tiny relative to pretraining: 128 random 2048-token C4 segments. To keep memory below full-precision inference, the implementation loads one 6-layer Transformer block at a time, accumulates layer Hessians, quantizes it, then runs the quantized block to produce inputs for the next block. The key scale result is that 175B models are quantized in about four GPU hours on one A100.

Bottleneck

GPTQ's bottleneck is large-LM inference memory, especially for low-batch autoregressive generation. FP16 OPT-175B needs about 326GB for weights before KV cache, so serving requires several GPUs purely for capacity. Once the model is distributed, single-token generation at batch size 1 is dominated by matrix-vector products, not large matrix-matrix products. The paper says these operations are primarily limited by memory bandwidth.

Lower-bit storage becomes a latency win only if the kernel actually moves fewer bytes during the matrix-vector product. A method that stores 8-bit weights but expands them into a bandwidth-heavy representation can reduce GPU count without reducing token latency. GPTQ therefore treats quantization quality and kernel layout as one systems problem: preserve perplexity at 3-4 bits and make generation bandwidth-bound on compressed weights rather than FP16 weights.

Method Adaptation

Algorithmically, GPTQ is a one-shot post-training weight quantizer using approximate second-order information. It scales Optimal Brain Quantization ideas to Transformer matrices by quantizing columns in arbitrary order and updating remaining weights to compensate for quantization error.

The compute adaptations are the important part. First, arbitrary order lets all rows be quantized in the same order, so the same inverse-Hessian information can be reused across rows rather than recomputed per row. Second, lazy batch updates address the low compute-to-memory-access ratio of direct OBQ-style updates. The paper quantizes blocks of B=128 columns at a time and defers global matrix updates until the block is complete. This does not reduce theoretical compute, but it improves GPU utilization and gives an order-of-magnitude speedup on very large models by avoiding memory-throughput stalls.

Third, Cholesky reformulation stabilizes inverse-Hessian updates at billion-parameter scale. The paper reports that numerical indefiniteness almost certainly occurs in some layers for models larger than a few billion parameters under the direct update path. Precomputing needed rows through Cholesky kernels, with mild dampening, is robust enough for huge models and faster.

Finally, the inference adaptation is a bespoke quantized-matrix/full-precision-vector GPU kernel. Activations remain full precision. The kernel dynamically dequantizes weights as needed but reads far fewer bytes from memory, which is exactly the right trade for batch-1 generative inference.

Evidence

The calibration evidence is direct: OPT-175B and BLOOM-176B can be quantized in about four GPU hours on one A100 80GB using only 128 random 2048-token C4 samples. The runtime table reports 1-3B models in minutes and 175B models in a few hours. The paper contrasts this with ZeroQuant-LKD taking about 3 hours for a 1.3B model on the same hardware.

Memory evidence is the clearest practical result. A 3-bit OPT-175B model requires about 63GB including embeddings and output layer kept in FP16. KV cache for a maximum 2048-token history adds about 9GB. That fits on a single 80GB A100, whereas FP16 execution requires 5 A100 80GB GPUs and LLM.int8() requires 3. The paper notes that the 4-bit model would not fully fit on one A100, which is why the 3-bit result is the capacity breakthrough.

Latency evidence comes from Table 6 at batch size 1 and generated sequence length 128. On A100 80GB, FP16 OPT-175B takes 230 ms per token on 5 GPUs, while the 3-bit GPTQ kernel takes 71 ms on 1 GPU, a 3.24x speedup and a 5-to-1 GPU reduction. On A6000 48GB, FP16 takes 589 ms on 8 GPUs, while the 3-bit kernel takes 130 ms on 2 GPUs, a 4.53x speedup and an 8-to-2 GPU reduction. The paper emphasizes that communication costs are negligible in this HuggingFace-accelerate-like setup and that almost all speedup is due to the compressed matrix-vector kernels.

Accuracy evidence supports the claim that low-bit weights remain useful. On OPT-175B WikiText2, 4-bit GPTQ loses only 0.03 perplexity versus FP16, while round-to-nearest loses 2.2 points. At 3-bit, round-to-nearest collapses on OPT-175B, while GPTQ maintains usable perplexity, especially with group size 128.

Historical Effect

GPTQ made post-training low-bit quantization a credible serving tool for very large open language models. Its historical effect is less "quantization exists" than "3-bit weights can move a 175B dense model from a multi-GPU capacity problem to a single-A100 or two-A6000 inference problem." Memory bandwidth and kernel design became first-class serving concerns.

It also clarified that the target workload matters. GPTQ's custom kernels are designed for low-batch autoregressive generation where weight reads dominate. The paper explicitly warns that large-batch or non-generative workloads may be compute-bound instead, in which case the right deployment strategy can differ.

Limits

  • The speedup is primarily from reduced memory movement, not faster arithmetic.
  • Activations are not quantized.
  • The custom kernel results target batch-size-1 generation; large-batch workloads may be compute-bound and less directly helped.
  • The largest one-A100 fit is 3-bit, not 4-bit, because 4-bit OPT-175B would not fully fit with cache.
  • Quality varies by model and bit width; RTN failures show that the quantization method matters, not just the number of bits.

Links