Gemma 3 Technical Report

Download PDF

Gemma 3 Technical Report

Metadata

Compute Setup

Gemma 3 states its training hardware directly. Table 2 lists TPUv5e for the 1B model on 512 chips, TPUv5e for the 4B model on 2048 chips, TPUv4 for the 12B model on 6144 chips, and TPUv5p for the 27B model on 6144 chips. The table also exposes the sharding shape: data, sequence, and replica shards are tuned per model, with the 27B run using data 24, sequence 8, and replica 32. The paper says each configuration is optimized to minimize training step time.

The system stack is Google TPU-pod training: ZeRO-3 optimizer sharding, Pathways data-replica reduction, JAX single-controller programming, GSPMD partitioning, and MegaScale XLA. For multimodal training, SigLIP image embeddings are precomputed and the frozen vision encoder is shared across 4B, 12B, and 27B models, so image processing does not add language-model training cost. Deployment is also part of the setup: the models target consumer-grade phones, laptops, and high-end GPUs.

Bottleneck

The dominant bottleneck is long-context inference memory. Gemma 3 aims for 128K context, but the report says KV-cache memory tends to explode with long context. This matters because the family is explicitly lightweight and deployable, not just a TPU training exercise. At 32768 context, Table 3 shows how large the memory problem becomes: Gemma 3 27B is 54.0 GB in bf16 weights, and 72.7 GB with KV cache; even the 12B model rises from 24.0 GB to 38.9 GB with KV cache.

The multimodal path creates a second deployment bottleneck. A fixed 896 x 896 SigLIP encoder is efficient and simple, but non-square or high-resolution images can lose small text and objects. Pan and Scan helps quality by using crops, but it is an inference-time-only optimization that can be disabled for speed. The overall design is therefore constrained by TPU training throughput, inference memory, local-device RAM/VRAM, and image prefill cost.

Method Adaptation

Gemma 3 adapts its architecture to long-context consumer deployment rather than making every layer global. It interleaves five local sliding-window attention layers for every global layer. Local layers use a 1024-token span, while global layers preserve document-level attention. The technical value is not only KV-cache reduction: short-range language modeling stays dense while cross-document retrieval remains possible. The paper raises RoPE base frequency on global layers, keeps local layers at 10000, trains mainly with 32K sequences, and scales larger models to 128K near the end.

The same compute logic governs vision. A 400M SigLIP encoder turns each image into 256 soft tokens. Higher-resolution encoders use average pooling to keep this token count fixed, preventing the vision path from expanding sequence length without bound. Pan and Scan spends extra inference work only when aspect ratio or text readability demands it; technically, it improves document and OCR-style visual understanding rather than merely saving memory.

Compression is treated as a release format, not an afterthought. Alongside raw checkpoints, the paper provides quantized versions obtained by about 5000 steps of quantization-aware training. The formats target common open inference engines, including per-channel int4, per-block int4, and switched fp8. Distillation is also central: pretraining samples 256 logits per token from a teacher distribution, and post-training uses an improved distillation recipe plus RL-style reward optimization.

Evidence

The KV-cache ablations support the architectural choice. The paper reports that changing local-to-global ratios has minimal perplexity impact in text-only ablations, and that shorter sliding windows can be used without a large perplexity penalty. At 32K context, a global-only 2B configuration has about 60% memory overhead, while local/global designs with shorter windows reduce the overhead to less than 15%. Figure 6 then compares KV memory growth for the 5:1, sliding-window-1024 design against a global-only 2B model up to 128K.

The memory table shows why quantization matters for actual deployment. For 27B, bf16 weights are 54.0 GB and bf16 plus 32K KV cache is 72.7 GB. Int4 lowers those to 14.1 GB and 32.8 GB; SFP8 gives 27.4 GB and 46.1 GB. For 4B, bf16 plus KV cache is 12.7 GB, while int4 plus KV is 7.3 GB.

Quality evidence is broad. Gemma 3 27B IT reaches an Elo of 1338 on Chatbot Arena, above Gemma 2 27B IT at 1220 in the table. On Table 6, Gemma 3 27B IT reports 67.5 MMLU-Pro, 29.7 LiveCodeBench, 89.0 MATH, 75.1 Global MMLU-Lite, and 64.9 MMMU. Long-context results are real but not perfect: Table 15 gives Gemma 3 27B PT at 72.9 on RULER 128K and 60.0 on MRCR 128K, while 27B IT is 66.0 and 59.3. For vision, Pan and Scan improves 27B DocVQA from 85.6 to 90.4 and InfoVQA from 59.4 to 76.4.

Historical Effect

Gemma 3 is a useful 2025 marker for the point where "small open model" no longer meant short context or text-only capability. Its compute story is the adaptation of a TPU-trained model family to ordinary serving devices: local/global attention to cap KV memory, fixed 256-token image embeddings to bound multimodal sequence cost, QAT checkpoints for common inference engines, and distillation to recover quality in compact models. Its technical story is the integration of long context, multilinguality, vision, and instruction following in one lightweight open family.

Historically, the report makes edge inference part of the base model design rather than a later compression pass. It also documents a cloud-to-device workflow: large TPU pods train the models, but the released artifacts are shaped for phones, laptops, high-end GPUs, and llama.cpp-style quantized runtimes.

Limits

The 128K claim has clear boundaries. The 1B model supports only 32K context, and the paper says performance generalizes to 128K but rapidly degrades when scaling further. Long-context instruction-tuned scores at 128K are lower than 32K for both RULER and MRCR in Table 15. The report also warns that even with decontamination there is always a risk of benchmark contamination in pretraining probes.

The model is not free of memory pressure. A raw 27B checkpoint with 32K KV cache still exceeds 70 GB, so many practical deployments depend on quantization and task-specific context lengths. Pan and Scan improves document/image tasks but increases inference work when enabled. Finally, the paper gives chip counts and software stack, but not full wall-clock time, energy, or utilization, so training efficiency cannot be reconstructed from the report alone.

Links