Gemini: A Family of Highly Capable Multimodal Models

Metadata

Reading status: read complete
Year: 2023
Compute regime: Hyperscale dense LLM training (hyperscale_dense_llm_training)
PDF: 2023-gemini_2023.pdf
Extracted text: 2023-gemini_2023.txt
PDF URL: https://arxiv.org/pdf/2312.11805.pdf
OpenAlex:
Citation count source/date:
Citation count:
Reading card created: 2026-06-15

Compute Setup

The report explicitly centers TPUv5e and TPUv4 infrastructure. It says Gemini models were trained using TPUv5e and TPUv4 depending on model size and configuration, and that Gemini Ultra used a large fleet of Google-owned TPUv4 accelerators across multiple datacenters. The hardware section describes TPUv4 SuperPods of 4096 chips, each connected to a dedicated optical switch that can reconfigure 4x4x4 chip cubes into arbitrary 3D torus topologies in about 10 seconds. Ultra combines SuperPods across multiple datacenters using Google's intra-cluster and inter-cluster networks.

The software stack is JAX, Pathways, GSPMD, XLA, and a MegaScale XLA pass for static collective scheduling. The report says model parallelism is used within SuperPods and data parallelism across SuperPods. Conventional checkpointing to persistent storage would not maintain high goodput at this scale, so Gemini uses redundant in-memory model-state copies and rapidly recovers from intact replicas after failures. The largest-scale training job's goodput improves from 85% to 97%.

Important omissions remain. The Gemini Ultra model card says compute requirements are not reported and model stats are not reported. Parameter count, exact training tokens, total FLOPs, and full chip count for Ultra are therefore not available in the local source.

Bottleneck

At Gemini Ultra scale, the bottleneck is the datacenter as much as the model. Adding accelerators lowers the mean time between failures for the overall job. Synchronous training must keep model-parallel and data-parallel collectives moving across SuperPods and datacenters. Checkpointing to persistent storage becomes too slow for high goodput. Silent Data Corruption, rare on a single device, becomes frequent enough at fleet scale that the report expects SDC events to affect training every week or two.

The architecture also has inference and memory constraints. Gemini is a decoder-only Transformer family trained for 32K context with efficient attention such as multi-query attention. It is natively multimodal across text, image, audio, and video. The family is split by compute target: Ultra for highest capability, Pro for cost and latency, and Nano for on-device memory-constrained deployment.

Method Adaptation

Gemini's method choices reflect both training and serving hardware. Model parallelism within a SuperPod keeps tightly coupled shards on high-speed TPU interconnect, while data parallelism across SuperPods uses datacenter networking for lower-volume synchronization. GSPMD partitions the training step, and the compiler schedules collectives to overlap with computation and reduce step-time variation.

Reliability mechanisms become part of the algorithmic surface. Redundant in-memory replicas replace slow checkpoint recovery for many failures. Deterministic replay helps isolate incorrect computations after SDC. Proactive SDC scanners and hot standbys remove suspect hardware. The report frames this deterministic infrastructure as crucial to stable Ultra training.

At the model level, Gemini is trained multimodally from the beginning rather than attaching vision/audio components only after text pretraining. It supports 32K context and uses efficient attention mechanisms. Nano is trained in two versions, 1.8B and 3.25B parameters, by distilling from larger Gemini models, making the family cover both TPU fleet training and on-device inference.

Evidence

The systems evidence is the goodput result: for the largest-scale training job, redundant in-memory recovery improves overall goodput from 85% to 97% compared with prior PaLM/PaLM-2-style approaches. The report also describes SDC handling as an observed training requirement rather than a hypothetical risk.

The benchmark evidence shows why the infrastructure was built. Gemini Ultra is reported to advance state of the art in 30 of 32 benchmarks in the report, including all 20 multimodal benchmarks examined. On MMLU, Ultra reaches 90.04%, above the human-expert benchmark of 89.8% and above the prior reported state of the art of 86.4%. The MMLU result is tied to inference-time compute: the report uses uncertainty-routed chain-of-thought with k samples such as 8 or 32, choosing the majority answer if confidence exceeds a validation-tuned threshold and otherwise falling back to greedy maximum-likelihood choice. In the appendix, Ultra improves from 84.0% greedy sampling to 90.0% with uncertainty-routed chain-of-thought at 32 samples, while 32 chain-of-thought samples alone only reach 85.0%.

Other reported evidence includes GSM8K at 94.4% with chain-of-thought and self-consistency, MATH at 53.2% with 4-shot prompting, 32% on 2022-2023 AMC problems, HumanEval at 74.4%, Natural2Code at 74.9%, and MMMU at 62.4%. Both training-scale compute and inference-time sample routing are part of the final capability story.

Historical Effect

Gemini is a clear example of a foundation model becoming a datacenter-scale distributed system. Its historical importance for compute structure is not just that it used TPUs, but that training reliability, optical topology, compiler scheduling, deterministic replay, and hot standbys are presented as necessary ingredients for the model to exist. The method is inseparable from the fleet.

It also links hyperscale dense multimodal training with inference-time reasoning. The MMLU result depends on uncertainty-routed multi-sample reasoning, while Ultra, Pro, and Nano cover different latency and memory budgets.

Limits

The report omits parameter counts for Pro and Ultra, exact training tokens, total training FLOPs, and precise compute requirements. The public card should therefore avoid turning "large fleet" into a numeric chip total beyond the stated SuperPod size and multi-datacenter description. Benchmark interpretation also has caveats. The report discusses contamination risk and chooses not to report some results after leakage analysis; it also notes that small additional task-specific fine-tuning can change HellaSwag validation results, showing sensitivity to dataset composition.

Finally, Gemini is a product-facing model family with post-training, safety, and deployment constraints that are not fully reproducible from the paper. The source supports a strong systems-compute story, but not a complete open accounting of model size or training cost.