资料来源 — AI 计算结构史

                                                                                                                                                            2025-03-12




                                         Gemma 3 Technical Report
                                         Gemma Team, Google DeepMind1


                                         We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging
                                         in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider
                                         coverage of languages and longer context – at least 128K tokens. We also change the architecture of
                                         the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by
                                         increasing the ratio of local to global attention layers, and keeping the span on local attention short.
                                         The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2
                                         for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe
                                         significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-




arXiv:2503.19786v1 [cs.CL] 25 Mar 2025
                                         4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across
                                         benchmarks. We release all our models to the community.


                                         1. Introduction                                                  layer, and assign a smaller span of only 1024
                                                                                                          tokens to the local layers. Therefore, only the
                                         We present the newest version of Gemma open                      global layers attend to long context, and we have
                                         language models (Gemma Team, 2024a), co-                         1 global for every 5 local layers.
                                         designed with the family of Gemini frontier mod-
                                         els (Gemini Team, 2023). This new version                          The pre-training optimization recipe is similar
                                         comes in sizes comparable to Gemma 2 (Gemma                      to Gemma 2, with some modifications in the ar-
                                         Team, 2024b), with the addition of a 1B model.                   chitecture design. We use the same tokenizer as
                                         These models are designed to run on standard                     Gemini 2.0, and we also revisit our data mixture
                                         consumer-grade hardware such as phones, lap-                     to improve the multilingual capabilities of the
                                         tops, and high-end GPUs. This version comes                      models, while introducing image understanding.
                                         with several new abilities to the Gemma family;                  All Gemma 3 models are trained with knowledge
                                         namely, multimodality, long context, and mul-                    distillation (Hinton et al., 2015).
                                         tilinguality, while preserving or surpassing the                    In post-training, we focus our efforts on im-
                                         performance of prior versions.                                   proving mathematics, reasoning, and chat abili-
                                            In terms of multimodality, most Gemma 3 mod-                  ties, as well as integrating the new capabilities of
                                         els are compatible with a tailored version of the                Gemma 3, long-context, and image inputs. We
                                         SigLIP vision encoder (Zhai et al., 2023). The                   use a novel post-training approach that brings
                                         language models treat images as a sequence of                    gains across all capabilities, including math, cod-
                                         soft tokens encoded by SigLIP. We reduce the in-                 ing, chat, instruction following, and multilingual.
                                         ference cost of image processing by condensing                   The resulting Gemma 3 instruction-tuned models
                                         the vision embeddings into a fixed size of 256                   are both powerful and versatile, outperforming
                                         vectors. The encoder works at a fixed resolution                 their predecessors by a wide margin.
                                         and we take inspiration from LLaVA (Liu et al.,                     In the following sections, we provide a brief
                                         2024) to enable flexible resolutions with a Pan                  overview of our models, including the architec-
                                         and Scan (P&S) method.                                           ture and pre- and post-training recipes. We also
                                           The second main architectural improvement is                   provide detailed evaluations across a wide vari-
                                         an increase in context size to 128K tokens, with-                ety of quantitative and qualitative benchmarks.
                                         out reducing performance. A challenge with long                  We discuss our approach to safe and responsible
                                         context is the memory explosion of the KV cache                  deployment and outline the broader implications
                                         during inference. To reduce this issue, we inter-                of Gemma 3, its limitations, and advantages.
                                         leave multiple local layers between each global


                                         1 See Contributions and Acknowledgments section for full author list. Please send correspondence to gemma-3-report@google.com.

                                         © 2025 Google DeepMind. All rights reserved
                                        Gemma 3 Technical Report



                                                                    Vision   Embedding    Non-embedding
                                                     Model
                                                                   Encoder   Parameters     Parameters
                                                     1B                 0         302M             698M
                                                     4B              417M         675M           3,209M
                                                     12B             417M       1,012M          10,759M
                                                     27B             417M       1,416M          25,600M

                                                     Table 1 | Parameter counts for the Gemma 3 mod-
                                                     els. Our vocabulary has 256k entries.


                                                     attention (Luong et al., 2015), with a pattern of
                                                     5 local layers for every global layer, starting with
                                                     a local layer as the first layer of the model.
                                                     Long context. Gemma 3 models support context
                                                     length of 128K tokens, with the exception of the
                                                     1B model that has 32K. We increase RoPE base
                                                     frequency from 10k to 1M on global self-attention
                                                     layers, and keep the frequency of the local lay-
                                                     ers at 10k. We follow a process similar to the
                                                     positional interpolation of Chen et al. (2023) to
                                                     extend the span of the global self-attention layers.


                                                     2.1. Vision modality

                                                     Vision encoder. We use a 400M variant of the
                                                     SigLIP encoder (Zhai et al., 2023), a Vision Trans-
Figure 1 | Example of visual interaction with        former (Dosovitskiy, 2020) trained with a varia-
Gemma 3 27B IT model.                                tion of the CLIP loss (Radford et al., 2021). The
                                                     Gemma vision encoder takes as input square im-
                                                     ages resized to 896 x 896, and is finetuned on
2. Model Architecture                                data from visual assistant tasks. For simplicity, we
                                                     share the vision encoder across our 4B, 12B, and
Gemma 3 models follow the same general               27B models, keeping it frozen during training.
decoder-only transformer architecture as previ-
                                                     Pan & Scan (P&S). The Gemma vision encoder
ous iterations (Vaswani et al., 2017), with most
                                                     operates at a fixed resolution of 896 × 896. This
architecture elements similar to the first two
                                                     results in artifacts when processing non-square
Gemma versions. We use a Grouped-Query Atten-
                                                     aspect ratios and high-resolution images, leading
tion (GQA) (Ainslie et al., 2023) with post-norm
                                                     to unreadable text, or small objects disappearing.
and pre-norm with RMSNorm (Zhang and Sen-
                                                     We address this issue with an adaptive windowing
nrich, 2019). Inspired by Dehghani et al. (2023),
                                                     algorithm during inference. This algorithm seg-
Wortsman et al. (2023) and Chameleon Team
                                                     ments images into non-overlapping crops of equal
(2024), we replace the soft-capping of Gemma 2
                                                     size, covering the whole image, and resize them
with QK-norm. In this section, we focus on some
                                                     to 896×896 pixels to pass them to the encoder.
key differences from previous versions below.
                                                     This windowing is applied only when necessary,
5:1 interleaving of local/global layers. We          and control for the maximum number of crops.
alternate between a local sliding window self-       It is an inference-time only optimization and can
attention (Beltagy et al., 2020) and global self-    be disabled for faster inference.


                                                                                                       2
                                          Gemma 3 Technical Report



                                   Shards                            Raw (GB)         Quantized (GB)
Model     Type      #Chips Data Seq. Replica             Model         bf16     Int4 Int4blocks=32 SFP8
1B      TPUv5e       512      16     16        2         1B            2.0      0.5        0.7         1.0
4B      TPUv5e      2048      16     16        8         +KV           2.9      1.4        1.6         1.9
12B     TPUv4       6144      16     16       24         4B            8.0      2.6        2.9         4.4
27B     TPUv5p      6144      24     8        32         +KV           12.7     7.3        7.6         9.1
Table 2 | Training infrastructure with sharding by       12B           24.0      6.6       7.1         12.4
data, sequence (Seq.), and replica.                      +KV           38.9     21.5       22.0        27.3
                                                         27B           54.0     14.1       15.3        27.4
                                                         +KV           72.7     32.8       34.0        46.1
2.2. Pre-training
                                                       Table 3 | Memory footprints (in GB) comparison
We follow a similar recipe as in Gemma 2 for           between raw (bfloat16) and quantized check-
pre-training with knowledge distillation.              points for weights and KV caching (+KV) at
Training data. We pre-train our models on a            32,768 context size, quantized in 8 bits.
slightly larger token budget than Gemma 2, i.e.,
we train on 14T tokens for Gemma 3 27B, 12T            2.3. Quantization Aware Training
for the 12B version, 4T for the 4B, and 2T to-
kens for the 1B. The increase in tokens accounts       Along with the raw checkpoints, we also provide
for the mix of images and text used during pre-        quantized versions of our models in different stan-
training. We also increase the amount of multi-        dard formats. These versions are obtained by fine-
lingual data to improve language coverage. We          tuning each model for a small number of steps,
add both monolingual and parallel data, and we         typically 5,000, using Quantization Aware Train-
handle the imbalance in language representation        ing (QAT) (Jacob et al., 2018). We use prob-
using a strategy inspired by Chung et al. (2023).      abilities from the non-quantized checkpoint as
                                                       targets, and adapt the data to match the pre-
Tokenizer. We use the same tokenizer as Gem-
                                                       training and post-training distributions. Based
ini 2.0: a SentencePiece tokenizer with split dig-
                                                       on the most popular open source quantization
its, preserved whitespace, and byte-level encod-
                                                       inference engines (e.g. llama.cpp), we focus on
ings (Kudo and Richardson, 2018). The resulting
                                                       three weight representations: per-channel int4,
vocabulary has 262k entries. This tokenizer is
                                                       per-block int4, and switched fp8. In Table 3, we
more balanced for non-English languages.
                                                       report the memory filled by raw and quantized
Filtering. We use filtering techniques that reduce     models for each weight representation with and
the risk of unwanted or unsafe utterances and          without a KV-cache for a sequence of 32k tokens.
remove certain personal information and other
sensitive data. We decontaminate evaluation sets       2.4. Compute Infrastructure
from our pre-training data mixture, and reduce
the risk of recitation by minimizing the prolifer-     We train our models with TPUv4, TPUv5e, and
ation of sensitive outputs. We also apply a qual-      TPUv5p as outlined in Table 2. Each model con-
ity reweighing step inspired by Sachdeva et al.        figuration is optimized to minimize training step
(2024) to reduce occurrences of low quality data.      time. For the vision encoder, we pre-compute
                                                       the embeddings for each image and directly train
Distillation. We sample 256 logits per token,
                                                       with the embeddings, adding no cost to the train-
weighted by teacher probabilities. The student
                                                       ing of the language models.
learns the teacher’s distribution within these sam-
ples via cross-entropy loss. The teacher’s target        The optimizer state is sharded using an im-
distribution is set to zero probability for non-       plementation of ZeRO-3 (Ren et al., 2021). For
sampled logits, and renormalized.                      multi-pod training, we perform a data replica re-


                                                                                                              3
                                           Gemma 3 Technical Report



     Context              Formatting                    following, and multilingual abilities, while mini-
     User turn            <start_of_turn>user           mizing model harmfulness. This includes learn-
                                                        ing from weight averaged reward models (Ramé
     Model turn           <start_of_turn>model
                                                        et al., 2024b) trained with human feedback data,
     End of turn          <end_of_turn>                 code execution feedback (Gehring et al., 2024),
                  Example of discussion:                and ground-truth rewards for solving math prob-
     User: Who are you?                                 lems (DeepSeek-AI, 2025; Lambert et al., 2024).
     Model: My name is Gemma!
     User: What is 2+2?                                 Data filtering. We carefully optimize the data
     Model: 2+2=4.                                      used in post-training to maximize model perfor-
                      Model input:                      mance. We filter examples that show certain per-
                                                        sonal information, unsafe or toxic model outputs,
    [BOS]<start_of_turn>user
    Who are you?<end_of_turn>                           mistaken self-identification data, and duplicated
    <start_of_turn>model                                examples. Including subsets of data that encour-
    My name is Gemma!<end_of_turn>                      age better in-context attribution, hedging, and
    <start_of_turn>user                                 refusals to minimize hallucinations also improves
    What is 2+2?<end_of_turn>
    <start_of_turn>model                                performance on factuality metrics, without de-
                                                        grading model performance on other metrics.
                      Model output:
     2+2=4.<end_of_turn>                                [BOS] token. For both PT and IT models, text
                                                        starts with a [BOS] token, that needs to be added
Table 4 | Formatting for Gemma IT models. Explic-       explicitly since the text “[BOS]” does not map to
itly add the [BOS] token after tokenization, or         the [BOS] token. For instance, Flax has an option,
use the add_bos=True option in the tokenizer.           add_bos=True, to add this token automatically
Do not tokenize the text "[BOS]".                       when tokenizing. An example of the formatting
                                                        for an IT model is shown in Table 4,
duction over the data center network, using the
Pathways approach of Barham et al. (2022). We           PT versus IT Formatting. All models share the
use the ‘single controller’ programming paradigm        same tokenizer, with some control tokens dedi-
of Jax (Roberts et al., 2023) and Pathways              cated to IT formatting. A key difference is that PT
(Barham et al., 2022), along with the GSPMD             models output a <eos> token at the end of gener-
partitioner (Xu et al., 2021) and the MegaScale         ation, while IT models output a <end_of_turn>
XLA compiler (XLA, 2019).                               at the end of the generation, as shown for IT in
                                                        Table 4. Fine-tuning either model type thus also
                                                        requires adding their respective end tokens.
3. Instruction-Tuning
Pre-trained models are turned into instruction-         4. Evaluation of final models
tuned models with an improved post-training ap-
proach compared to our prior recipe (see Table 6).      In this section, we evaluate the IT models over
                                                        a series of automated benchmarks and human
Techniques. Our post-training approach relies           evaluations across a variety of domains, as well
on an improved version of knowledge distilla-           as static benchmarks such as MMLU.
tion (Agarwal et al., 2024; Anil et al., 2018; Hin-
ton et al., 2015) from a large IT teacher, along
with a RL finetuning phase based on improved ver-       4.1. LMSYS Chatbot Arena
sions of BOND (Sessa et al., 2024), WARM (Ramé
                                                        In this section, we report the performance of our
et al., 2024b), and WARP (Ramé et al., 2024a).
                                                        IT 27B model on LMSys Chatbot Arena (Chiang
Reinforcement learning objectives. We use               et al., 2024) in blind side-by-side evaluations by
a variety of reward functions to improve help-          human raters against other state-of-the-art mod-
fulness, math, coding, reasoning, instruction-          els. We report Elo scores in Table 5. Gemma 3 27B


                                                                                                         4
                                            Gemma 3 Technical Report



     Rank    Model                                  Elo     95% CI     Open   Type    #params/#activated
     1       Grok-3-Preview-02-24                  1412    +8/-10       -       -             -
     1       GPT-4.5-Preview                       1411    +11/-11      -       -             -
     3       Gemini-2.0-Flash-Thinking-Exp-01-21   1384     +6/-5       -       -             -
     3       Gemini-2.0-Pro-Exp-02-05              1380     +5/-6       -       -             -
     3       ChatGPT-4o-latest (2025-01-29)        1377     +5/-4       -       -             -
     6       DeepSeek-R1                           1363     +8/-6      yes     MoE        671B/37B
     6       Gemini-2.0-Flash-001                  1357     +6/-5       -       -             -
     8       o1-2024-12-17                         1352     +4/-6       -       -             -
     9       Gemma-3-27B-IT                        1338     +8/-9      yes    Dense         27B
     9       Qwen2.5-Max                           1336     +7/-5       -       -             -
     9       o1-preview                            1335     +4/-3       -       -             -
     9       o3-mini-high                          1329     +8/-6       -       -             -
     13      DeepSeek-V3                           1318     +8/-6      yes     MoE        671B/37B
     14      GLM-4-Plus-0111                       1311     +8/-8       -       -             -
     14      Qwen-Plus-0125                        1310     +7/-5       -       -             -
     14      Claude 3.7 Sonnet                     1309    +9/-11       -       -             -
     14      Gemini-2.0-Flash-Lite                 1308     +5/-5       -       -             -
     18      Step-2-16K-Exp                        1305     +7/-6       -       -             -
     18      o3-mini                               1304     +5/-4       -       -             -
     18      o1-mini                               1304     +4/-3       -       -             -
     18      Gemini-1.5-Pro-002                    1302     +3/-3       -       -             -
     ...
     28      Meta-Llama-3.1-405B-Instruct-bf16     1269     +4/-3      yes    Dense         405B
     ...
     38      Llama-3.3-70B-Instruct                1257     +5/-3      yes    Dense          70B
     ...
     39      Qwen2.5-72B-Instruct                  1257     +3/-3      yes    Dense          72B
     ...
     59      Gemma-2-27B-it                        1220     +3/-2      yes    Dense          27B

Table 5 | Evaluation of Gemma 3 27B IT model in the Chatbot Arena (Chiang et al., 2024). All the
models are evaluated against each other through blind side-by-side evaluations by human raters. Each
model is attributed a score, based on the Elo rating system. Gemma-3-27B-IT numbers are preliminary
results received on March 8, 2025.


IT (1338) is among the top 10 best models, with a         follow third-party static leaderboards for a fairer
score above other non-thinking open models, such          comparison across models. We include additional
as DeepSeek-V3 (1318), LLaMA 3 405B (1257),               evaluations of our models on other benchmarks
and Qwen2.5-70B (1257), which are much larger             in the appendix.
models. Finally, the Elo of Gemma 3 is signifi-
cantly higher than Gemma 2, at 1220. Note that
Elo scores do not take into account visual abilities,
                                                          5. Ablations
which none of the aforementioned models have.             In this section, we focus on the impact of our
                                                          architecture changes, as well as some of the vision
4.2. Standard benchmarks                                  abilities new to this model.

In Table 6, we show the performance of our final          5.1. Pre-training ability probing
models across a variety of benchmarks compared
to our previous model iteration, and Gemini 1.5.          We use several standard benchmarks as probes
We do not compare directly with external models           during pre-training to ensure our models capture
that often report their own evaluation settings,          general abilities, and in Figure 2, we compare the
since running them in our setting does not guaran-        quality of pre-trained models from Gemma 2 and
tee a fair comparison. We encourage the reader to         3 across these general abilities, namely, science,


                                                                                                           5
                                         Gemma 3 Technical Report



                      Gemini 1.5      Gemini 2.0                Gemma 2                                Gemma 3
                      Flash   Pro    Flash    Pro        2B                 9B       27B       1B       4B      12B          27B
   MMLU-Pro           67.3    75.8    77.6   79.1       15.6            46.8         56.9     14.7     43.6     60.6        67.5
   LiveCodeBench      30.7    34.2    34.5   36.0       1.2             10.8         20.4      1.9     12.6     24.6        29.7
   Bird-SQL (dev)     45.6    54.4    58.7   59.3       12.2            33.8         46.7      6.4     36.3     47.9        54.4
   GPQA Diamond       51.0    59.1    60.1   64.7       24.7            28.8         34.3     19.2     30.8     40.9        42.4
   SimpleQA            8.6    24.9    29.9   44.3        2.8             5.3          9.2      2.2      4.0      6.3        10.0
   FACTS Grounding    82.9    80.0    84.6   82.8       43.8            62.0         62.4     36.4     70.1     75.8        74.9
   Global MMLU-Lite   73.7    80.8    83.4   86.5       41.9            64.8         68.6     34.2     54.5     69.5        75.1
   MATH               77.9    86.5    90.9   91.8       27.2            49.4         55.6     48.0     75.6     83.8        89.0
   HiddenMath         47.2    52.0    63.5   65.2        1.8            10.4         14.8     15.8     43.0     54.5        60.3
   MMMU (val)         62.3    65.9    71.7   72.7         -                 -         -         -      48.8     59.6        64.9

Table 6 | Performance of instruction fine-tuned (IT) models compared to Gemini 1.5, Gemini 2.0, and
Gemma 2 on zero-shot benchmarks across different abilities.




Figure 2 | Summary of the performance of different pre-trained models from Gemma 2 and 3 across
general abilities. These plots are meant to give a simplified summary and details are in the appendix.


code, factuality, multilinguality, reasoning, and                           0.1                                        2B
vision. The details of the performance across the                                                                      9B

                                                               Perplexity
different public benchmarks used in these plots                             0.0
are summarized in the appendix. Overall, we see
that the new versions improve in most categories,                           0.1
despite the addition of vision. We particularly                                1:1          3:1           5:1               7:1
focus on multilinguality in this version, and this                                             Local:Global
directly impacts the quality of our models. How-
ever, despite the use of decontamination tech-        Figure 3 | Impact of Local:Global ratio on the
niques, there is always a risk of contamination       perplexity on a validation set. The impact is mini-
of these probes (Mirzadeh et al., 2024), making       mal, even with 7-to-1 local to global. This ablation
more definitive conclusions harder to assess.         is run with text-only models.



5.2. Local:Global attention layers                    ent ratios of local to global attention layers. 1:1
                                                      is used in Gemma 2 models, and 5:1 is used in
We measure the impact of changes to local and         Gemma 3. We observe minimal impact on per-
global self-attention layers on performance and       plexity when changing this ratio.
memory consumption during inference.
                                                      Sliding window size. In Fig. 4, we compare
Local:Global ratio. In Fig. 3, we compare differ-     different sliding window sizes for the local at-


                                                                                                                                   6
                                                                                                 Gemma 3 Technical Report



tention layers in different global:local ratio con-



                                                                                                                     KV Cache memory (MB)
                                                                                                                                            6000        2B L:G=5:1, sw=1024
figurations. The sliding window can be reduced                                                                                                          2B global only
significantly without impacting perplexity.                                                                                                 4000

                                            0.01                                                                                            2000
                                                                                                                                              0

                               Perplexity
                                            0.00                                                                                                   1K       4K 8K 16K 32K 64K 128K
                                                                                                                                                              Context length
                                            0.01                                     2B L:G=1:1
                                                                                     2B L:G=3:1                 Figure 6 | KV cache memory versus context
                                            0.02
                                                   512             1024         2048             4096           length. We show the memory usage of the KV
                                                                    Sliding Window                              cache for our architecture (L:G=5:1, sw=1024)
                                                                                                                and a transformer with global attention only – as
Figure 4 | Impact of Sliding Window size on per-                                                                used in LLaMa or Gemma 1.
plexity measured on a validation set. We consider
2 2B models, with 1:1 and 1:3 local to global layer
ratios. This ablation is run with text-only models.                                                             quences and then scale the 4B, 12B, and 27B mod-
                                                                                                                els up to 128K tokens at the end of pre-training
Impact on KV cache memory. In Fig. 5, we show                                                                   while rescaling RoPE (Chen et al., 2023). We
the balance between the memory used by the                                                                      find a scaling factor of 8 to work well in practice.
model and the KV cache during inference with a                                                                  Note that compared to Gemma 2, we have also
context of 32k tokens. The “global only” configu-                                                               increased the RoPE base frequency of global self-
ration is the standard configuration used across                                                                attention layers from 10k to 1M, while keeping
most dense models. The “1:1, sw=4096” is used                                                                   10k for the local self-attention layers. In Figure 7,
in Gemma 2. We observe that the “global only”                                                                   we show the impact on perplexity for different
configuration results in a memory overhead of                                                                   context lengths. Our models generalize to 128K,
60%, while this is reduced to less than 15% with                                                                but rapidly degrade as we continue to scale.
1:3 and sliding windows of 1024 (“sw=1024”).
In Fig. 6, we compute the memory used by the
KV cache as a function of the context length with
either our 2B architecture (L:G=5:1, sw=1024)
versus a “global only” 2B model.
                        5000                                                                         model




Inference memory (MB)
                        4000                                                                         kv cache

                        3000
                        2000
                        1000
                          0
                                    global only     1:1, sw=4096     1:1 sw=1024   1:3 sw=4096    1:3 sw=1024


Figure 5 | Model versus KV cache memory dur-
ing inference with a pre-fill KV cache of size 32k.
We consider a 2B model with different local to
global ratios and sliding window sizes (sw). We
compare to global only, which is the standard
                                                                                                                Figure 7 | Long context performance of pre-
used in Gemma 1 and Llama. This ablation is run
                                                                                                                trained models before and after RoPE rescaling.
with a text-only model.


5.3. Enabling long context                                                                                      5.4. Small versus large teacher

Instead of training with 128K sequences from                                                                    A common finding is that, to train a small model,
scratch, we pre-train our models with 32K se-                                                                   it is preferable to distill from a smaller teacher.


                                                                                                                                                                                     7
                                                    Gemma 3 Technical Report



                    0.002                                        resolution encoder has a 4x4 average pooling on
                                                                 its output. As shown in Table 7, higher resolution
                    0.000                                        encoders perform better than smaller ones.

       Perplexity
                    0.002
                                                                                       DocVQA InfoVQA TextVQA
                    0.004
                                                                        4B               72.8          44.1           58.9
                    0.006                                            4B w/ P&S           81.0          57.0           60.8
                            101            102                             Δ             (+8.2)       (+12.9)        (+1.9)
                            Total training tokens (B)                  27B               85.6          59.4           68.6
                                                                    27B w/ P&S           90.4          76.4           70.2
Figure 8 | Small versus large teacher. Relative                            Δ             (+4.8)       (+17.0)        (+1.6)
difference of perplexity when using a small and
large teacher as a function of the token size of                 Table 8 | Impact of P&S. 4-shot evaluation re-
training. Smaller numbers means distilling from                  sults on the valid set, with and without P&S on a
a larger teacher is better.                                      pre-trained checkpoint. Boosts are on tasks asso-
                                                                 ciated with images with varying aspect ratios, or
                                                                 involving reading text on images.
We suspect this is because these studies are often
performed in settings where the regularization ef-               Pan & Scan. P&S enables capturing images at
fect of using a worse teacher surpasses the benefit              close to their native aspect ratio and image reso-
of using a better teacher. We train a student with               lution. In Table 8, we compare our 27B IT model
2 teachers of different sizes, one large and one                 with and without P&S. As expected, the ability
small, for different training horizons. In Fig. 8,               to treat images with close to native resolution
we observe that for short training horizons, the                 greatly helps with tasks that require some form
smaller teacher is better, but the trend is reversed             of reading text on images, which is particularly
for longer training.                                             important for visual language models.

5.5. Vision encoder
                                                                 6. Memorization and Privacy

   Resolution DocVQA InfoVQA TextVQA                             Large language models may produce near-copies
                                                                 of some text used in training (Biderman et al.,
      256               31.9          23.1        44.1
                                                                 2023; Carlini et al., 2021, 2022; Ippolito et al.,
      448               45.4          31.6        53.5
                                                                 2022; Nasr et al., 2023). Several prior reports
      896               59.8          33.7        58.0
                                                                 have released audits that quantify this risk by
                                                                 measuring the memorization rate (Anil et al.,
Table 7 | Impact of image encoder input reso-
                                                                 2023; Chowdhery et al., 2022; Gemini Team,
lution. We measure performance using a short
                                                                 2023, 2024; Gemma Team, 2024a,b; LLaMa
schedule 2B Gemma model on a few evaluation
                                                                 Team, 2024). This “memorization rate”1 is de-
benchmarks to observe the effect of input image
                                                                 fined as the ratio of generations from the model
resolution on vision encoder pre-training.
                                                                 that match its training data compared to all model
Impact of image resolution. We use a vision                      generations using the following setup. We fol-
encoder based on SigLIP (Zhai et al., 2023). The                 low the methodology described in Gemma Team
vision encoder is frozen, and only the language                     1 "We do not state or imply [here] that a model "contains"

model is trained. Each image in this multimodal                  its training data in the sense that there is a copy of that data
data is represented by 256 image tokens from                     in the model. Rather, a model memorizes attributes of its
                                                                 training data such that in certain cases it is statistically able
the respective vision encoder. The higher resolu-                to generate such training data when following rules and
tion encoders thus use average pooling to reduce                 using information about features of its training data that it
their output to 256 tokens. For instance, the 896                does contain."



                                                                                                                                8
                                                                  Gemma 3 Technical Report


                                        Total Memorization Rate                designed to have high recall and does not con-
                10
                        Memorization Type
                       Exact        Approximate
                                                                               sider the context in which the information may
                 1                                                             appear, which leads to many false positives. Thus,
                                                                               we are likely overestimating the true amount of
% Memorized
                0.1
                                                                               potentially personal information contained in the
               0.01
                                                                               outputs classified as memorized. SDP also pro-
              0.001                                                            vides broad severity levels: low, medium, and
                                                                               high. We classify text as personal if SDP clas-
        0.0001        Ge m
                        1B ma
                      Ge 3
                         m
                        4B ma
                      Ge 3
                                             Ge
                                             Ge
                                             Ge
                                            Ge
                                                m
                                               2B ma
                                                m
                                               9B ma
                                                m
                                              27 ma
                                                 B 2
                                                      2
                                                      2                        sifies it as personal information at any severity
                                                                               level. We observed no personal information in
                       12mma                   m
                                            Fla ini
                                                sh 1.5
                                               Ge
                                                2Bmma
                         B
                      Ge 3
                       27mma                   Ge
                                                7Bmma
                         B 3                   SmPaLM
                                                  all
                                                  Model
                                                                               the outputs characterized as memorization for all
  Figure 9 | Total memorization rates for both ex-                             Gemma 3 models. This indicates a low rate of
  act and approximate memorization. Gemma 3                                    personal data, below our detection thresholds, in
  models memorize significantly less than all prior                            outputs classified as memorization.
  models. *No results for approximate memoriza-
  tion on these models.
                                                                               7. Responsibility, Safety, Security

 (2024b) to measure it. Specifically, we subsam-                               Responsibility, safety, and security are of utmost
 ple a large portion of training data distributed                              importance in the development of Gemma mod-
 uniformly across different corpora and test for                               els. To reduce risks to Gemma 3 users, we have
 discoverable extraction (Nasr et al., 2023) of this                           continued to integrate enhanced internal safety
 content using a prefix of length 50 and a suffix of                           processes that span the development workflow,
 length 50. We denote text as either “exactly mem-                             in line with recent Google AI models (Gemini
 orized” if all tokens in the continuation match                               Team, 2024). This focuses on safety mitigation at
 the source suffix or “approximately memorized”                                training time, and robust and transparent model
 if they match up to an edit distance of 10%.                                  evaluations for the new image-to-text capabilities
                                                                               we have introduced.
    Figure 9 compares the memorization rates
 across Gemma and Gemini models; these models
 are ordered in reverse chronological order, with                              7.1. Governance & Assessment
 the newest Gemma 3 models on the left. We find
 that Gemma 3 models memorize long-form text                                   Our approach to assessing the benefits and risks
 at a much lower rate than prior models (note the                              of Gemma is reflective of that outlined for Gemma
 log y-axis). We observe only a marginal differ-                               1 (Gemma Team, 2024a), taking into account the
 ence in the memorization rates between the 4B,                                changes in supported modalities. We continue to
 12B, and 27B models, with 1B memorizing less                                  believe that openness in AI can spread the bene-
 than these larger models. Further, we find that a                             fits of these technologies across society, but must
 larger proportion of text is characterized as ap-                             be evaluated against the risk of malicious uses
 proximately memorized, with a relative increase                               that can cause harm on both individual and in-
 in approximate memorization compared to exact                                 stitutional levels (Weidinger et al., 2021). Since
 memorization of roughly 24x on average.                                       the inaugural Gemma launch, we have seen these
                                                                               models drive a number of socially beneficial ap-
   We also study the rate at which the generations                             plications, such as our own ShieldGemma 2, a 4B
 may contain personal information. To identify po-                             image safety classifier built with Gemma 3, which
 tentially personal information, we use the Google                             provides a ready-made solution for image safety,
 Cloud Sensitive Data Protection (SDP) service.2                               outputting safety labels across dangerous content,
 SDP uses broad detection rules to identify text                               sexually explicit, and violence categories.
 that may contain personal information. SDP is
                                                                                 Releasing Gemma 3 models required specific
              2 https://cloud.google.com/sensitive-data-protection             attention to changes in model capabilities and


                                                                                                                                9
                                         Gemma 3 Technical Report



close monitoring of the evolving risks of existing    rigorous risk assessment. Our internal safety pro-
multimodal LLMs (Lin et al., 2024), as well as an     cesses are designed accordingly, and for previ-
understanding of the ways in which models are         ous Gemma models we have also undertaken
being used in the wild. Although we are yet to        evaluations of capabilities relevant to extreme
receive any reports of malicious use for Gemma,       risks (Phuong et al., 2024; Shevlane et al., 2023).
we remain committed to investigating any such         As we continue to develop and share open mod-
reporting, and work with the academic and de-         els, we will follow the heuristic that thoroughly
veloper communities, as well as conduct our own       evaluating a more capable model often provides
monitoring, to flag such cases.                       sufficient assurance for less capable ones. As such,
                                                      we prioritised a streamlined set of evaluations for
   Despite advancements in capabilities, we be-
                                                      Gemma 3, reserving in-depth dangerous capabil-
lieve that, given the number of larger powerful
                                                      ity assessments for cases where a specific model
open models available, this release will have a
                                                      may present a potentially heightened risk (as de-
negligible effect on the overall risk landscape.
                                                      scribed below on CBRN evaluations). We balance
                                                      development speed with targeted safety testing,
7.2. Safety policies and train-time mitigations       ensuring our evaluations are well-focused and
                                                      efficient, while upholding the commitments laid
A key pillar of Gemma’s approach to safety is to      out in our Frontier Safety Framework.
align fine-tuned models with Google’s safety poli-
cies, in line with Gemini models (Gemini Team,
                                                      Baseline Evaluations
2023). They are designed to help prevent our
models from generating harmful content, i.e.,         Baseline assurance captures the model violation
                                                      rate for safety policies, using a large number of
  • Child sexual abuse and exploitation               synthetic adversarial user queries, and human
  • Revealing personally identifiable information     raters to label the answers as policy violating or
    that can lead to harm (e.g., Social Security      not. Overall, Gemma 3 violation rate is signifi-
    numbers)                                          cantly low overall on these safety policies.
  • Hate speech and harassment
  • Dangerous or malicious content (including         Chemical, Biological, Radiological and Nuclear
    promoting self-harm or instructing in harm-       (CBRN) knowledge
    ful activities)
                                                      Owing to enhanced performance on STEM-
  • Sexually explicit content
                                                      related tasks, we evaluated knowledge relevant
  • Medical advice that runs contrary to scientific
                                                      to biological, radiological, and nuclear risks using
    or medical consensus
                                                      an internal dataset of closed-ended, knowledge-
                                                      based multiple choice questions. For evaluations
We undertook considerable safety filtering of our     of chemical knowledge, we employed a closed-
pre-training data to reduce the likelihood of our     ended knowledge-based approach on chemical
pre-trained and fine-tuned checkpoints producing      hazards developed by Macknight et al. Our eval-
harmful content. For fine-tuned models, we also       uation suggests that the knowledge of Gemma 3
use both SFT and RLHF to steer the model away         models in these domains is low.
from undesirable behavior.

                                                      7.4. Our approach to responsible open models
7.3. Assurance Evaluations
                                                      Designing safe, secure, and responsible applica-
We also run our IT models through a set of base-      tions requires a system-level approach, working
line assurance evaluations to understand the po-      to mitigate risks associated with each specific use
tential harms that our models can cause. As we        case and environment. We will continue to adopt
champion open models, we also recognize that          assessments and safety mitigations proportion-
the irreversible nature of weight releases requires   ate to the potential risks from our models, and


                                                                                                       10
                                           Gemma 3 Technical Report



will only share these with the community when           A. Asai, J. Kasai, J. H. Clark, K. Lee, E. Choi,
we are confident that the benefits significantly          and H. Hajishirzi. Xor qa: Cross-lingual open-
outweigh the foreseeable risks.                           retrieval question answering. arXiv preprint
                                                          arXiv:2010.11856, 2020.

8. Discussion and Conclusion                            J. Austin, A. Odena, M. I. Nye, M. Bosma,
                                                           H. Michalewski, D. Dohan, E. Jiang, C. J. Cai,
In this work, we have presented Gemma 3, the               M. Terry, Q. V. Le, and C. Sutton. Program
latest addition to the Gemma family of open lan-           synthesis with large language models. CoRR,
guage models for text, image, and code. In this            abs/2108.07732, 2021.
version, we focus on adding image understanding
and long context while improving multilinguality        P. Barham, A. Chowdhery, J. Dean, S. Ghemawat,
and STEM-related abilities. Our model sizes and            S. Hand, D. Hurt, M. Isard, H. Lim, R. Pang,
architectures are designed to be compatible with           S. Roy, B. Saeta, P. Schuh, R. Sepassi, L. E.
standard hardware, and most of our architecture            Shafey, C. A. Thekkath, and Y. Wu. Path-
improvements are tailored to fit this hardware             ways: Asynchronous distributed dataflow for
while maintaining performance.                             ml, 2022.

                                                        I. Beltagy, M. E. Peters, and A. Cohan. Long-
References                                                 former: The long-document transformer. arXiv
                                                           preprint arXiv:2004.05150, 2020.
Realworldqa. https://x.ai/news/grok-1.
  5v.                                            S. Biderman, U. Prashanth, L. Sutawika,
                                                   H. Schoelkopf, Q. Anthony, S. Purohit, and
M. Acharya, K. Kafle, and C. Kanan. Tallyqa: An-   E. Raff. Emergent and predictable memoriza-
  swering complex counting questions. In AAAI,     tion in large language models. NeurIPS, 36:
  2018.                                            28072–28090, 2023.

R. Agarwal, N. Vieillard, Y. Zhou, P. Stanczyk, S. R.   Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi.
  Garea, M. Geist, and O. Bachem. On-policy                PIQA: reasoning about physical commonsense
  distillation of language models: Learning from           in natural language. CoRR, abs/1911.11641,
  self-generated mistakes. In ICLR, 2024.                  2019.

J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyan-       N. Carlini, F. Tramer, E. Wallace, M. Jagielski,
   skiy, F. Lebrón, and S. Sanghai. Gqa: Training         A. Herbert-Voss, K. Lee, A. Roberts, T. Brown,
   generalized multi-query transformer models             D. Song, U. Erlingsson, et al. Extracting train-
   from multi-head checkpoints. arXiv preprint            ing data from large language models. In
   arXiv:2305.13245, 2023.                                USENIX, 2021.

R. Anil, G. Pereyra, A. Passos, R. Ormandi, G. E.       N. Carlini, D. Ippolito, M. Jagielski, K. Lee,
  Dahl, and G. E. Hinton. Large scale distributed         F. Tramer, and C. Zhang. Quantifying memo-
  neural network training through online distil-          rization across neural language models. arXiv
  lation. arXiv preprint arXiv:1804.03235, 2018.          preprint arXiv:2202.07646, 2022.

R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lep-       Chameleon Team. Chameleon: Mixed-modal
  ikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey,     early-fusion foundation models. arXiv preprint
  Z. Chen, et al. Palm 2 technical report. arXiv          arXiv:2405.09818, 2024.
  preprint arXiv:2305.10403, 2023.
                                                        M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P.
M. Artetxe, S. Ruder, and D. Yogatama. On the            de Oliveira Pinto, J. Kaplan, H. Edwards,
 cross-lingual transferability of monolingual rep-       Y. Burda, N. Joseph, G. Brockman, A. Ray,
 resentations. In ACL, 2020.                             R. Puri, G. Krueger, M. Petrov, H. Khlaaf,


                                                                                                          11
                                          Gemma 3 Technical Report



  G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ry-      H. W. Chung, N. Constant, X. Garcia, A. Roberts,
  der, M. Pavlov, A. Power, L. Kaiser, M. Bavar-         Y. Tay, S. Narang, and O. Firat. Unimax: Fairer
  ian, C. Winter, P. Tillet, F. P. Such, D. Cum-         and more effective language sampling for large-
  mings, M. Plappert, F. Chantzis, E. Barnes,            scale multilingual pretraining, 2023.
  A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino,
  N. Tezak, J. Tang, I. Babuschkin, S. Balaji,         C. Clark, K. Lee, M. Chang, T. Kwiatkowski,
  S. Jain, W. Saunders, C. Hesse, A. N. Carr,            M. Collins, and K. Toutanova. Boolq: Explor-
  J. Leike, J. Achiam, V. Misra, E. Morikawa,            ing the surprising difficulty of natural yes/no
  A. Radford, M. Knight, M. Brundage, M. Murati,         questions. CoRR, abs/1905.10044, 2019.
  K. Mayer, P. Welinder, B. McGrew, D. Amodei,
                                                       K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen,
  S. McCandlish, I. Sutskever, and W. Zaremba.
                                                         H. Jun, L. Kaiser, M. Plappert, J. Tworek,
  Evaluating large language models trained on
                                                         J. Hilton, R. Nakano, C. Hesse, and J. Schul-
  code. CoRR, abs/2107.03374, 2021.
                                                         man. Training verifiers to solve math word
S. Chen, S. Wong, L. Chen, and Y. Tian. Extend-          problems. CoRR, abs/2110.14168, 2021.
   ing context window of large language mod-
                                                       DeepSeek-AI. Deepseek-r1: Incentivizing reason-
   els via positional interpolation. arXiv preprint
                                                         ingt learning, 2025.
   arXiv:2306.15595, 2023.
                                                       M. Dehghani, J. Djolonga, B. Mustafa,
X. Chen, H. Fang, T.-Y. Lin, R. Vedantam, S. Gupta,
                                                        P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner,
   P. Dollár, and C. L. Zitnick. Microsoft coco
                                                        M. Caron, R. Geirhos, I. Alabdulmohsin, et al.
   captions: Data collection and evaluation server.
                                                        Scaling vision transformers to 22 billion
   ArXiv, abs/1504.00325, 2015.
                                                        parameters. In ICML, 2023.
W.-L. Chiang, L. Zheng, Y. Sheng, A. N. An-
                                                       D. Deutsch, E. Briakou, I. Caswell, M. Finkelstein,
 gelopoulos, T. Li, D. Li, H. Zhang, B. Zhu,
                                                         R. Galor, J. Juraska, G. Kovacs, A. Lui, R. Rei,
 M. Jordan, J. E. Gonzalez, and I. Stoica. Chat-
                                                         J. Riesa, S. Rijhwani, P. Riley, E. Salesky, F. Tra-
 bot arena: An open platform for evaluating
                                                         belsi, S. Winkler, B. Zhang, and M. Freitag.
 llms by human preference, 2024.
                                                         Wmt24++: Expanding the language coverage
F. Chollet. On the measure of intelligence. arXiv        of wmt24 to 55 languages & dialects, 2025.
   preprint arXiv:1911.01547, 2019.
                                                       A. Dosovitskiy. An image is worth 16x16 words:
A. Chowdhery, S. Narang, J. Devlin, M. Bosma,            Transformers for image recognition at scale.
  G. Mishra, A. Roberts, P. Barham, H. W.                arXiv preprint arXiv:2010.11929, 2020.
  Chung, C. Sutton, S. Gehrmann, P. Schuh,
                                                       D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh,
  K. Shi, S. Tsvyashchenko, J. Maynez, A. Rao,
                                                         and M. Gardner. DROP: A reading comprehen-
  P. Barnes, Y. Tay, N. Shazeer, V. Prabhakaran,
                                                         sion benchmark requiring discrete reasoning
  E. Reif, N. Du, B. Hutchinson, R. Pope, J. Brad-
                                                         over paragraphs. In ACL, 2019.
  bury, J. Austin, M. Isard, G. Gur-Ari, P. Yin,
  T. Duke, A. Levskaya, S. Ghemawat, S. Dev,           B. Fatemi, M. Kazemi, A. Tsitsulin, K. Malkan,
  H. Michalewski, X. Garcia, V. Misra, K. Robin-         J. Yim, J. Palowitch, S. Seo, J. Halcrow, and
  son, L. Fedus, D. Zhou, D. Ippolito, D. Luan,          B. Perozzi. Test of time: A benchmark for
  H. Lim, B. Zoph, A. Spiridonov, R. Sepassi,            evaluating llms on temporal reasoning. arXiv
  D. Dohan, S. Agrawal, M. Omernick, A. M. Dai,          preprint arXiv:2406.09170, 2024.
  T. S. Pillai, M. Pellat, A. Lewkowycz, E. Moreira,
  R. Child, O. Polozov, K. Lee, Z. Zhou, X. Wang,      X. Fu, Y. Hu, B. Li, Y. Feng, H. Wang, X. Lin,
  B. Saeta, M. Diaz, O. Firat, M. Catasta, J. Wei,       D. Roth, N. A. Smith, W.-C. Ma, and R. Krishna.
  K. Meier-Hellstern, D. Eck, J. Dean, S. Petrov,        Blink: Multimodal large language models can
  and N. Fiedel. Palm: Scaling language model-           see but not perceive. ArXiv, abs/2404.12390,
  ing with pathways, 2022.                               2024.


                                                                                                          12
                                            Gemma 3 Technical Report



J. Gehring, K. Zheng, J. Copet, V. Mella, T. Cohen,      G. Hinton, O. Vinyals, and J. Dean. Distilling the
   and G. Synnaeve. Rlef: Grounding code llms in           knowledge in a neural network. arXiv preprint
   execution feedback with reinforcement learn-            arXiv:1503.02531, 2015.
   ing. arXiv preprint arXiv:2410.02089, 2024.
                                                         C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya,
Gemini Team. Gemini: A family of highly capable            D. Rekesh, F. Jia, Y. Zhang, and B. Ginsburg.
  multimodal models, 2023.                                 Ruler: What’s the real context size of your
                                                           long-context language models? arXiv preprint
Gemini Team. Gemini 1.5: Unlocking multimodal              arXiv:2404.06654, 2024.
  understanding across millions of tokens of con-
  text, 2024.                                            D. Ippolito, F. Tramèr, M. Nasr, C. Zhang,
                                                           M. Jagielski, K. Lee, C. A. Choquette-Choo, and
Gemma Team. Gemma: Open models based on
                                                           N. Carlini. Preventing verbatim memorization
  gemini research and technology, 2024a.
                                                           in language models gives a false sense of pri-
Gemma Team. Gemma 2: Improving open lan-                   vacy. arXiv preprint arXiv:2210.17546, 2022.
  guage models at a practical size. arXiv preprint
                                                         B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang,
  arXiv:2408.00118, 2024b.
                                                           A. Howard, H. Adam, and D. Kalenichenko.
O. Goldman, U. Shaham, D. Malkin, S. Eiger,                Quantization and training of neural networks
  A. Hassidim, Y. Matias, J. Maynez, A. M. Gi-             for efficient integer-arithmetic-only inference.
  lady, J. Riesa, S. Rijhwani, L. Rimell, I. Szpektor,     In CVPR, 2018.
  R. Tsarfaty, and M. Eyal. Eclektic: a novel chal-
  lenge set for evaluation of cross-lingual knowl-       M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer.
  edge transfer, 2025.                                    Triviaqa: A large scale distantly supervised
                                                           challenge dataset for reading comprehension.
N. Goyal, C. Gao, V. Chaudhary, P.-J. Chen,               CoRR, abs/1705.03551, 2017.
  G. Wenzek, D. Ju, S. Krishnan, M. Ranzato,
  F. Guzmán, and A. Fan. The flores-101 evalua-          M. Kazemi, H. Alvari, A. Anand, J. Wu, X. Chen,
  tion benchmark for low-resource and multilin-           and R. Soricut. Geomverse: A systematic eval-
  gual machine translation. ACL, 2022.                    uation of large models for geometric reasoning.
                                                          arXiv preprint arXiv:2312.12241, 2023.
Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and
   D. Parikh. Making the V in VQA matter: Elevat-        M. Kazemi, N. Dikkala, A. Anand, P. Dević, I. Das-
   ing the role of image understanding in Visual           gupta, F. Liu, B. Fatemi, P. Awasthi, D. Guo,
   Question Answering. In CVPR, 2017.                      S. Gollapudi, and A. Qureshi. Remi: A dataset
                                                           for reasoning with multiple images. ArXiv,
D. Hendrycks, C. Burns, S. Basart, A. Zou,                 abs/2406.09175, 2024a.
  M. Mazeika, D. Song, and J. Steinhardt. Mea-
  suring massive multitask language understand-          M. Kazemi, Q. Yuan, D. Bhatia, N. Kim,
  ing. CoRR, abs/2009.03300, 2020.                        X. Xu, V. Imbrasaite, and D. Ramachandran.
                                                          Boardgameqa: A dataset for natural lan-
D. Hendrycks, C. Burns, S. Kadavath, A. Arora,            guage reasoning with contradictory informa-
  S. Basart, E. Tang, D. Song, and J. Steinhardt.         tion. NeurIPS, 36, 2024b.
  Measuring mathematical problem solving with
  the math dataset. NeurIPS, 2021.                       M. Kazemi, B. Fatemi, H. Bansal, J. Palowitch,
                                                          C. Anastasiou, S. V. Mehta, L. K. Jain, V. Aglietti,
J. Hessel, A. Marasović, J. D. Hwang, L. Lee, J. Da,      D. Jindal, P. Chen, et al. Big-bench extra hard.
   R. Zellers, R. Mankoff, and Y. Choi. Do an-            arXiv preprint arXiv:2502.19187, 2025.
   droids laugh at electric sheep? humor" under-
   standing" benchmarks from the new yorker cap-         A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Ha-
   tion contest. arXiv preprint arXiv:2209.06293,          jishirzi, and A. Farhadi. A diagram is worth a
   2022.                                                   dozen images. ArXiv, abs/1603.07396, 2016.


                                                                                                           13
                                           Gemma 3 Technical Report



E. Kıcıman, R. Ness, A. Sharma, and C. Tan.             M. Mathew, V. Bagal, R. Tito, D. Karatzas, E. Val-
  Causal reasoning and large language models:            veny, and C. Jawahar. Infographicvqa. In WACV,
  Opening a new frontier for causality. arXiv            2022.
  preprint arXiv:2305.00050, 2023.
                                                        I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel,
T. Kudo and J. Richardson. SentencePiece: A                S. Bengio, and M. Farajtabar. Gsm-symbolic:
  simple and language independent subword to-              Understanding the limitations of mathemati-
  kenizer and detokenizer for neural text pro-             cal reasoning in large language models. arXiv
  cessing. 2018.                                           preprint arXiv:2410.05229, 2024.

T. Kwiatkowski, J. Palomaki, O. Redfield,               M. Nasr, N. Carlini, J. Hayase, M. Jagielski, A. F.
  M. Collins, A. Parikh, C. Alberti, D. Epstein,         Cooper, D. Ippolito, C. A. Choquette-Choo,
  I. Polosukhin, J. Devlin, K. Lee, K. Toutanova,        E. Wallace, F. Tramèr, and K. Lee. Scal-
  L. Jones, M. Kelcey, M.-W. Chang, A. M. Dai,           able extraction of training data from (pro-
  J. Uszkoreit, Q. Le, and S. Petrov. Natural ques-      duction) language models. arXiv preprint
  tions: A benchmark for question answering re-          arXiv:2311.17035, 2023.
  search. ACL, 2019.
                                                        A. Nie, Y. Zhang, A. S. Amdekar, C. Piech, T. B.
N. Lambert, J. Morrison, V. Pyatkin, S. Huang,            Hashimoto, and T. Gerstenberg. Moca: Mea-
  H. Ivison, F. Brahman, L. J. V. Miranda, A. Liu,        suring human-language model alignment on
  N. Dziri, S. Lyu, et al. T\" ulu 3: Pushing             causal and moral judgment tasks. NeurIPS, 36,
  frontiers in open language model post-training.         2024.
  arXiv preprint arXiv:2411.15124, 2024.
                                                        R. Paiss, A. Ephrat, O. Tov, S. Zada, I. Mosseri,
Z. Lin, J. Cui, X. Liao, and X. Wang. Malla: De-          M. Irani, and T. Dekel. Teaching clip to count
  mystifying real-world large language model              to ten. ICCV, 2023.
  integrated malicious services, 2024.
                                                        M. Phuong, M. Aitchison, E. Catt, S. Co-
H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruc-     gan, A. Kaskasoli, V. Krakovna, D. Lindner,
  tion tuning. NeurIPS, 36, 2024.                        M. Rahtz, Y. Assael, S. Hodkinson, H. Howard,
                                                         T. Lieberum, R. Kumar, M. A. Raad, A. Webson,
LLaMa Team. The llama 3 herd of models. arXiv            L. Ho, S. Lin, S. Farquhar, M. Hutter, G. Dele-
  preprint arXiv:2407.21783, 2024.                       tang, A. Ruoss, S. El-Sayed, S. Brown, A. Dra-
M. Luong, H. Pham, and C. D. Manning. Effective          gan, R. Shah, A. Dafoe, and T. Shevlane. Evalu-
  approaches to attention-based neural machine           ating frontier models for dangerous capabilities,
  translation. 2015.                                     2024.

Macknight, Aung, and Gomes. Personal Commu-             A. Radford, J. W. Kim, C. Hallacy, A. Ramesh,
 nication.                                                G. Goh, S. Agarwal, G. Sastry, A. Askell,
                                                          P. Mishkin, J. Clark, et al. Learning transferable
K. Marino, M. Rastegari, A. Farhadi, and R. Mot-          visual models from natural language supervi-
  taghi. Ok-vqa: A visual question answering              sion. In ICML, pages 8748–8763. PMLR, 2021.
  benchmark requiring external knowledge. In
  CVPR, 2019.                                           A. Ramé, J. Ferret, N. Vieillard, R. Dadashi,
                                                          L. Hussenot, P.-L. Cedoz, P. G. Sessa, S. Girgin,
A. Masry, X. L. Do, J. Q. Tan, S. Joty, and E. Hoque.     A. Douillard, and O. Bachem. WARP: On the
  ChartQA: A benchmark for question answering             benefits of weight averaged rewarded policies,
  about charts with visual and logical reasoning.         2024a.
  ACL, 2022.
                                                        A. Ramé, N. Vieillard, L. Hussenot, R. Dadashi,
M. Mathew, D. Karatzas, R. Manmatha, and C. V.            G. Cideron, O. Bachem, and J. Ferret. WARM:
 Jawahar. Docvqa: A dataset for vqa on docu-              On the benefits of weight averaged reward mod-
  ment images. WACV, 2020.                                els. In ICML, 2024b.


                                                                                                         14
                                          Gemma 3 Technical Report



D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y.      V. Bolina, J. Clark, Y. Bengio, P. Christiano, and
  Pang, J. Dirani, J. Michael, and S. R. Bow-             A. Dafoe. Model evaluation for extreme risks,
  man. Gpqa: A graduate-level google-proof q&a            2023.
  benchmark. ArXiv, abs/2311.12022, 2023.
                                                       F. Shi, M. Suzgun, M. Freitag, X. Wang, S. Sri-
J. Ren, S. Rajbhandari, R. Y. Aminabadi,                  vats, S. Vosoughi, H. W. Chung, Y. Tay, S. Ruder,
  O. Ruwase, S. Yang, M. Zhang, D. Li, and                D. Zhou, D. Das, and J. Wei. Language models
  Y. He. Zero-offload: Democratizing billion-             are multilingual chain-of-thought reasoners. In
  scale model training. In USENIX, 2021.                  ICLR, 2023.
A. Roberts, H. W. Chung, G. Mishra, A. Levskaya,       A. Singh, V. Natarjan, M. Shah, Y. Jiang, X. Chen,
  J. Bradbury, D. Andor, S. Narang, B. Lester,           D. Parikh, and M. Rohrbach. Towards vqa mod-
  C. Gaffney, A. Mohiuddin, et al. Scaling up            els that can read. In CVPR, 2019.
  models and data with t5x and seqio. JMLR,
  2023.                                                H. Singh, N. Gupta, S. Bharadwaj, D. Tewari,
                                                         and P. Talukdar. Indicgenbench: a multilin-
N. Sachdeva, B. Coleman, W.-C. Kang, J. Ni,              gual benchmark to evaluate generation capabil-
  L. Hong, E. H. Chi, J. Caverlee, J. McAuley, and       ities of llms on indic languages. arXiv preprint
  D. Z. Cheng. How to train data-efficient llms.         arXiv:2404.16816, 2024a.
  arXiv preprint arXiv:2402.09668, 2024.
                                                       S. Singh, A. Romanou, C. Fourrier, D. I. Adelani,
K. Sakaguchi, R. L. Bras, C. Bhagavatula, and             J. G. Ngui, D. Vila-Suero, P. Limkonchotiwat,
  Y. Choi.    WINOGRANDE: an adversarial                  K. Marchisio, W. Q. Leong, Y. Susanto, R. Ng,
  winograd schema challenge at scale. CoRR,               S. Longpre, W.-Y. Ko, M. Smith, A. Bosselut,
  abs/1907.10641, 2019.                                   A. Oh, A. F. T. Martins, L. Choshen, D. Ippolito,
E. Sánchez, B. Alastruey, C. Ropers, P. Stenetorp,        E. Ferrante, M. Fadaee, B. Ermis, and S. Hooker.
   M. Artetxe, and M. R. Costa-jussà. Linguini:           Global mmlu: Understanding and addressing
  A benchmark for language-agnostic linguistic            cultural and linguistic biases in multilingual
   reasoning. arXiv preprint arXiv:2409.12126,            evaluation, 2024b.
  2024.                                                A. Steiner, A. S. Pinto, M. Tschannen, D. Key-
M. Sap, H. Rashkin, D. Chen, R. L. Bras,                 sers, X. Wang, Y. Bitton, A. Gritsenko, M. Min-
 and Y. Choi.     Socialiqa: Commonsense                 derer, A. Sherbondy, S. Long, S. Qin, R. In-
 reasoning about social interactions. CoRR,              gle, E. Bugliarello, S. Kazemzadeh, T. Mes-
 abs/1904.09728, 2019.                                   nard, I. Alabdulmohsin, L. Beyer, and X. Zhai.
                                                         PaliGemma 2: A Family of Versatile VLMs
P. G. Sessa, R. Dadashi, L. Hussenot, J. Ferret,         for Transfer. arXiv preprint arXiv:2412.03555,
   N. Vieillard, A. Ramé, B. Shariari, S. Perrin,        2024.
   A. Friesen, G. Cideron, S. Girgin, P. Stanczyk,
   A. Michi, D. Sinopalnikov, S. Ramos, A. Héliou,     M. Suzgun, N. Scales, N. Schärli, S. Gehrmann,
   A. Severyn, M. Hoffman, N. Momchev, and              Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le,
   O. Bachem. Bond: Aligning llms with best-of-n        E. H. Chi, D. Zhou, and J. Wei. Challenging
   distillation, 2024.                                  big-bench tasks and whether chain-of-thought
                                                        can solve them, 2022.
K. Shah, N. Dikkala, X. Wang, and R. Panigrahy.
  Causal language modeling can elicit search and       G. Tyen, H. Mansoor, P. Chen, T. Mak, and
  reasoning capabilities on logic puzzles. arXiv         V. Cărbune. Llms cannot find reasoning er-
  preprint arXiv:2409.10502, 2024.                       rors, but can correct them! arXiv preprint
                                                         arXiv:2311.08516, 2023.
T. Shevlane, S. Farquhar, B. Garfinkel, M. Phuong,
   J. Whittlestone, J. Leung, D. Kokotajlo, N. Mar-    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit,
   chal, M. Anderljung, N. Kolt, L. Ho, D. Sid-          L. Jones, A. N. Gomez, L. Kaiser, and I. Polo-
   darth, S. Avin, W. Hawkins, B. Kim, I. Gabriel,       sukhin. Attention is all you need. 2017.


                                                                                                         15
                                         Gemma 3 Technical Report



K. Vodrahalli, S. Ontanon, N. Tripuraneni, K. Xu,        benchmark for spatial relation recognition.
  S. Jain, R. Shivanna, J. Hui, N. Dikkala,              ICCV, 2019.
  M. Kazemi, B. Fatemi, et al. Michelangelo:
  Long context evaluations beyond haystacks           X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang,
  via latent structure queries. arXiv preprint           S. Stevens, D. Jiang, W. Ren, Y. Sun, C. Wei,
  arXiv:2409.12640, 2024.                                B. Yu, R. Yuan, R. Sun, M. Yin, B. Zheng,
                                                         Z. Yang, Y. Liu, W. Huang, H. Sun, Y. Su,
Y. Wang, X. Ma, G. Zhang, Y. Ni, A. Chandra,             and W. Chen. Mmmu: A massive multi-
  S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang,           discipline multimodal understanding and rea-
  et al. Mmlu-pro: A more robust and challenging         soning benchmark for expert agi. CVPR, 2023.
  multi-task language understanding benchmark.
  In NeurIPS, 2024.                                   R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and
                                                        Y. Choi. HellaSwag: Can a machine really finish
L. Weidinger, J. Mellor, M. Rauh, C. Griffin,           your sentence? In ACL, 2019.
  J. Uesato, P.-S. Huang, M. Cheng, M. Glaese,
                                                      X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer.
  B. Balle, A. Kasirzadeh, Z. Kenton, S. Brown,
                                                        Sigmoid loss for language image pre-training.
  W. Hawkins, T. Stepleton, C. Biles, A. Birhane,
                                                        In CVPR, 2023.
  J. Haas, L. Rimell, L. A. Hendricks, W. Isaac,
  S. Legassick, G. Irving, and I. Gabriel. Ethical    B. Zhang and R. Sennrich. Root mean square
  and social risks of harm from language models,        layer normalization. 2019.
  2021.
                                                      J. Zhang, L. Jain, Y. Guo, J. Chen, K. L. Zhou,
C. White, S. Dooley, M. Roberts, A. Pal, B. Feuer,       S. Suresh, A. Wagenmaker, S. Sievert, T. Rogers,
  S. Jain, R. Shwartz-Ziv, N. Jain, K. Saiful-           K. Jamieson, et al. Humor in ai: Massive
  lah, S. Naidu, et al. Livebench: A challeng-           scale crowd-sourced preferences and bench-
  ing, contamination-free llm benchmark. arXiv           marks for cartoon captioning. arXiv preprint
  preprint arXiv:2406.19314, 2024.                       arXiv:2406.10522, 2024.
M. Wortsman, P. J. Liu, L. Xiao, K. Everett,          W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang,
 A. Alemi, B. Adlam, J. D. Co-Reyes, I. Gur, A. Ku-    A. Saied, W. Chen, and N. Duan. Agieval: A
 mar, R. Novak, et al. Small-scale proxies for          human-centric benchmark for evaluating foun-
 large-scale transformer training instabilities.        dation models, 2023.
 arXiv preprint arXiv:2309.14322, 2023.

XLA. Xla: Optimizing compiler for tensor-
  flow, 2019. URL https://www.tensorflow.
  org/xla.
Y. Xu, H. Lee, D. Chen, B. A. Hechtman, Y. Huang,
   R. Joshi, M. Krikun, D. Lepikhin, A. Ly, M. Mag-
   gioni, R. Pang, N. Shazeer, S. Wang, T. Wang,
   Y. Wu, and Z. Chen. GSPMD: general and scal-
   able parallelization for ML computation graphs.
   2021.

Y. Yamada, Y. Bao, A. K. Lampinen, J. Kasai,
  and I. Yildirim. Evaluating spatial understand-
  ing of large language models. arXiv preprint
  arXiv:2310.14540, 2023.

K. Yang, O. Russakovsky, and J. Deng. Spa-
  tialsense: An adversarially crowdsourced


                                                                                                       16
                         Gemma 3 Technical Report



Core contributors                     Contributors (alphabetical order)
Aishwarya Kamath∗                     Abe Friesen
Johan Ferret∗                         Abhanshu Sharma
Shreya Pathak∗                        Abheesht Sharma
Nino Vieillard∗                       Adi Mayrav Gilady
Ramona Merhej∗                        Adrian Goedeckemeyer
Sarah Perrin∗                         Alaa Saade
Tatiana Matejovicova∗                 Alex Feng
Alexandre Ramé∗                       Alexander Kolesnikov
Morgane Rivière∗                      Alexei Bendebury
Louis Rouillard∗                      Alvin Abdagic
Thomas Mesnard∗                       Amit Vadi
Geoffrey Cideron∗                     András György
Jean-bastien Grill∗                   André Susano Pinto
Sabela Ramos∗                         Anil Das
Edouard Yvinec∗                       Ankur Bapna
Michelle Casbon∗                      Antoine Miech
Etienne Pot                           Antoine Yang
Ivo Penchev                           Antonia Paterson
Gaël Liu                              Ashish Shenoy
Francesco Visin                       Ayan Chakrabarti
Kathleen Kenealy                      Bilal Piot
Lucas Beyer                           Bo Wu
Xiaohai Zhai                          Bobak Shahriari
Anton Tsitsulin                       Bryce Petrini
Robert Busa-Fekete                    Charlie Chen
Alex Feng                             Charline Le Lan
Noveen Sachdeva                       Christopher A. Choquette-Choo
Benjamin Coleman                      CJ Carey
Yi Gao                                Cormac Brick
Basil Mustafa                         Daniel Deutsch
Iain Barr                             Danielle Eisenbud
Emilio Parisotto                      Dee Cattle
David Tian                            Derek Cheng
Matan Eyal                            Dimitris Paparas
Colin Cherry                          Divyashree Shivakumar Sreepathihalli
Jan-Thorsten Peter                    Doug Reid
Danila Sinopalnikov                   Dustin Tran
Surya Bhupatiraju                     Dustin Zelle
Rishabh Agarwal                       Eric Noland
Mehran Kazemi                         Erwin Huizenga
Dan Malkin                            Eugene Kharitonov
Ravin Kumar                           Frederick Liu
David Vilar                           Gagik Amirkhanyan
Idan Brusilovsky                      Glenn Cameron
Jiaming Luo                           Hadi Hashemi
Andreas Steiner                       Hanna Klimczak-Plucińska
                                      Harman Singh
                                      Harsh Mehta
   ∗ co-first authors.




                                                                             17
                         Gemma 3 Technical Report



Harshal Tushar Lehri                  Reza Rokni
Hussein Hazimeh                       Rob Willoughby
Ian Ballantyne                        Rohith Vallu
Idan Szpektor                         Ryan Mullins
Ivan Nardini                          Sammy Jerome
Jean Pouget-Abadie                    Sara Smoot
Jetha Chan                            Sertan Girgin
Joe Stanton                           Shariq Iqbal
John Wieting                          Shashir Reddy
Jonathan Lai                          Shruti Sheth
Jordi Orbay                           Siim Põder
Joseph Fernandez                      Sijal Bhatnagar
Josh Newlan                           Sindhu Raghuram Panyam
Ju-yeong Ji                           Sivan Eiger
Jyotinder Singh                       Susan Zhang
Kat Black                             Tianqi Liu
Kathy Yu                              Trevor Yacovone
Kevin Hui                             Tyler Liechty
Kiran Vodrahalli                      Uday Kalra
Klaus Greff                           Utku Evci
Linhai Qiu                            Vedant Misra
Marcella Valentine                    Vincent Roseberry
Marina Coelho                         Vlad Feinberg
Marvin Ritter                         Vlad Kolesnikov
Matt Hoffman                          Woohyun Han
Matthew Watson                        Woosuk Kwon
Mayank Chaturvedi                     Xi Chen
Michael Moynihan                      Yinlam Chow
Min Ma                                Yuvein Zhu
Nabila Babar                          Zichuan Wei
Natasha Noy                           Zoltan Egyed
Nathan Byrd
Nick Roy
                                      Support
Nikola Momchev
                                      Victor Cotruta
Nilay Chauhan
                                      Minh Giang
Noveen Sachdeva
                                      Phoebe Kirk
Oskar Bunyan
                                      Anand Rao
Pankil Botarda
                                      Kat Black
Paul Caron
                                      Nabila Babar
Paul Kishan Rubenstein
                                      Jessica Lo
Phil Culliton
                                      Erica Moreira
Philipp Schmid
                                      Luiz Gustavo Martins
Pier Giuseppe Sessa
                                      Omar Sanseviero
Pingmei Xu
                                      Lucas Gonzalez
Piotr Stanczyk
                                      Zach Gleicher
Pouya Tafti
                                      Tris Warkentin
Rakesh Shivanna
Renjie Wu
Renke Pan                             Sponsors



                                                               18
                         Gemma 3 Technical Report



Vahab Mirrokni
Evan Senter
Eli Collins
Joelle Barral
Zoubin Ghahramani
Raia Hadsell
Yossi Matias
D. Sculley
Slav Petrov
Noah Fiedel
Noam Shazeer
Oriol Vinyals
Jeff Dean
Demis Hassabis
Koray Kavukcuoglu
Clement Farabet


Technical advisors
Elena Buchatskaya
Jean-Baptiste Alayrac
Rohan Anil
Dmitry (Dima) Lepikhin
Sebastian Borgeaud
Olivier Bachem


Lead
Armand Joulin


Technical leads
Alek Andreev
Cassidy Hardin
Robert Dadashi
Léonard Hussenot




                                                    19
                                          Gemma 3 Technical Report



Appendix                                                                  Gemma 2             Gemma 3

Details of pre-trained performances.                                    2B     9B    27B   4B    12B 27B
                                                       MMLU         52.2 71.2 75.2 59.6 74.5 78.6
                                                       MMLUpro      22.2 43.7 49.4 29.2 45.3 52.2
           Gemma 2                Gemma 3
                                                       AGIE         31.6 53.1 55.1 42.1 57.4 66.2
         2B    9B    27B     1B    4B   12B 27B        MATH         16.4 36.4 42.1 24.2 43.3 50.0
HellaS 72.9 81.9 86.4 62.3 77.2 84.2 85.6              GSM8K        25.0 70.2 74.6 38.4 71.0 82.6
BoolQ 75.6 77.5 76.2 63.2 72.3 78.8 82.4               GPQA Diamond 12.5 24.8 26.3 15.0 25.4 24.3
                                                       MBPP         31.0 51.2 60.8 46.0 60.4 65.6
PIQA    78.1 81.9 83.5 73.8 79.6 81.8 83.3             HumanE       19.5 40.2 51.2 36.0 45.7 48.8
SIQA    51.8 53.3 53.8 48.9 51.9 53.4 54.9
TQA     60.2 76.5 83.8 39.8 65.8 78.2 85.5             Table 10 | STEM and code performance after pre-
NQ      17.2 29.2 34.7 9.48 20.0 31.4 36.1             training phase.
ARC-C 55.8 69.1 71.4 38.4 56.2 68.9 70.6
ARC-E 80.6 88.3 88.6 73.0 82.4 88.3 89.0               pre-trained models. On code, we see a similar
WinoG 65.4 73.9 79.4 58.2 64.7 74.3 78.8               improvement for the 4B and 12B models but not
BBH   42.4 69.4 74.8 28.4 50.9 72.6 77.7               on the 27B.
Drop 53.2 71.5 75.2 42.4 60.1 72.2 77.2
                                                                                4B     12B      27B
Table 9 | Factuality, common-sense performance
and reasoning after pre-training phase.                     COCO caption       102     111      116
                                                            DocVQA             72.8    82.3     85.6
                                                            InfoVQA            44.1    54.8     59.4
Factuality and common-sense. In Table 9, we
                                                            MMMU               39.2    50.3     56.1
report the performance of our new pre-trained
                                                            TextVQA            58.9    66.5     68.6
benchmarks compared to previous versions. We
                                                            RealWorldQA        45.5    52.2     53.9
consider several standard benchmarks, namely
                                                            ReMI               27.3    38.5     44.8
HellaSwag (Zellers et al., 2019), BoolQ (Clark
                                                            AI2D               63.2    75.2     79.0
et al., 2019), PIQA (Bisk et al., 2019), SIQA (Sap
                                                            ChartQA            63.6    74.7     76.3
et al., 2019), TriviaQA (Joshi et al., 2017), Natu-
                                                            VQAv2              63.9    71.2     72.9
ral Questions (Kwiatkowski et al., 2019), ARC-C
                                                            BLINK              38.0    35.9     39.6
and ARC-E (Chollet, 2019), WinoGrande (Sak-
                                                            OK-VQA             51.0    58.7     60.2
aguchi et al., 2019), BBH (Suzgun et al., 2022),
                                                            TallyQA            42.5    51.8     54.3
DROP (Dua et al., 2019). Evaluation details are
                                                            SpatialSense VQA   50.9    60.0     59.4
described in Table 19. Overall, our models are in
                                                            CountBench VQA     26.1    17.8     68.0
the same ballpark as Gemma 2, which is encour-
aging since these abilities are not the focus of the   Table 11 | Multimodal performance after pre-
improvements brought in this version.                  training phase. The scores are on the val split
STEM and code. The details of our per-                 of each dataset without P&S.
formance on STEM and Code are in Ta-
ble 10. We consider several standard bench-            Image understanding. In Table 11, we re-
marks, namely MMLU (Hendrycks et al., 2020),           port performance across a variety of visual
MMLU-Pro (Wang et al., 2024), AGIEval (Zhong           question answer benchmarks for the different
et al., 2023), MATH (Hendrycks et al., 2021),          models that were trained with a vision en-
GSM8K (Cobbe et al., 2021), GPQA (Rein                 coder, namely COCO Caption (Chen et al.,
et al., 2023), MBPP (Austin et al., 2021), Hu-         2015), DocVQA (Mathew et al., 2020), Info-
manEval (Chen et al., 2021). Evaluation details        graphicVQA (Mathew et al., 2022), MMMU (Yue
are described in Table 19. Overall we see a consis-    et al., 2023), TextVQA (Singh et al., 2019), Re-
tent improvement over STEM abilities across our        alWorldQA (Rea), ReMI (Kazemi et al., 2024a),


                                                                                                       20
                                         Gemma 3 Technical Report



AI2D (Kembhavi et al., 2016), ChartQA (Masry                         Gemma 2             Gemma 3
et al., 2022), VQA v2 (Goyal et al., 2017),
                                                                    2B    9B 27B    1B    4B 12B 27B
BLINK (Fu et al., 2024), OK-VQA (Marino et al.,
2019), TallyQA (Acharya et al., 2018), Spa-           MGSM     18.7 57.3 68.0       2.04 34.7 64.3 74.3
tialSense VQA (Yang et al., 2019), CountBench         GMMLU 43.3 64.0 69.4          24.9 57.0 69.4 75.7
VQA (Paiss et al., 2023). Evaluation details are      WMT24++ 38.8 50.3 53.0        36.7 48.4 53.9 55.7
described in Table 20.                                Flores   30.2 41.3 44.3       29.5 39.2 46.0 48.8
                                                      XQuAD 53.7 72.2 73.9          43.9 68.0 74.5 76.8
                  PaliGemma 2        Gemma 3          ECLeKTic 8.29 14.0 17.1       4.69 11.0 17.2 24.4
                 2B    9B   27B    4B   12B 27B       IndicGB 47.4 59.3 62.1        41.4 57.2 61.7 63.4
 DocVQA         81.6 86.3 85.1    86.1 89.0 89.5
 InfoVQA        41.4 53.1 50.2    55.6 61.6 64.6
                                                      Table 13 | Multilingual performance after the pre-
 TextVQA        76.3 76.3 75.1    79.1 81.6 83.2      training phase. IndicGenBench is an average over
 ChartQA        70.7 79.1 71.3    79.8 83.5 83.4      benchmarks reported in Table 14.
 AI2D         76.0 84.4 84.6      80.9 85.6 86.5
 OKVQA        64.1 68.6 70.6      65.2 69.3 71.1      et al., 2022), XQuAD (Artetxe et al., 2020),
 CountBenchQA 82.0 85.3 87.4      79.4 83.5 87.8
                                                      ECLeKTic (Goldman et al., 2025), IndicGen-
 COCO caption 143. 145. 145.      143. 143. 144.      Bench (Singh et al., 2024a), XOR QA (Asai et al.,
 VQAv2        84.8 85.8 85.8      84.1 84.9 85.1      2020). Evaluation details are described in Ta-
 Tally QA     80.6 82.4 82.1      79.0 81.3 81.7
                                                      ble 19.
Table 12 | Performance of pre-trained checkpoints
                                                                          Gemma 2             Gemma 3
after fine-tuning on multi-modal benchmarks
(without P&S). PaliGemma 2 was transferred at                            2B   9B 27B     1B   4B 12B 27B
896x896 resolution for the first four benchmarks,     XQuAD Indic 54.3 73.1 74.9       43.1 68.3 75.2 77.8
and at 448x448 resolution for the others.             XORQA in-en 66.2 69.3 72.5       56.3 68.3 69.8 70.4
                                                      XORQA in-xx 31.2 40.8 44.3       27.1 39.8 43.8 46.0
Comparison to PaliGemma 2. We fine-tune mul-          Flores Indic 38.1 54.0 56.9      39.0 52.3 58.0 59.5
timodal Gemma 3 pre-trained checkpoints fol-
lowing the protocol from Steiner et al. (2024) –      Table 14 | Detailed IndicGenBench performance
only learning rate is swept, otherwise the same       after the pre-training phase.
transfer settings are used. The results in Table 12
show that Gemma 3 excels at benchmarks in-            Long context. In Table 15 we report the per-
volving document understanding, even outper-          formance of pre-trained and fine-tuned mod-
forming the larger PaliGemma 2 variant. Note          els on long context benchmarks. We include
that due to average pooling in the vision en-         RULER (Hsieh et al., 2024) and MRCR (Vodra-
coder the Gemma 3 4B and 12B models are               halli et al., 2024) benchmarks evaluating at 32K
about 10x cheaper to transfer compared with the       and 128K sequence lengths.
PaliGemma 2 9B and 27B models at the same 896
x 896 resolution. Gemma 3 also performs better
on AI2D and OKVQA, but PaliGemma 2 performs           8.1. Performance of IT models
slightly better on VQAv2 and COCO caption.
                                                      We report in Table 18, additional benchmarks
Multilinguality. In Table 13 we report the per-       on our IT models. Note that N2C refers to
formance of the pre-trained models on multilin-       Natural2Code, the Gemini 1.0 internal held-out
gual tasks. We apply in-context learning with         dataset, which uses author-generated sources in-
multi-shot prompting and present results on           stead of web-based information. BBEH refers to
the following benchmarks: MGSM (Shi et al.,           BIG-Bench Extra Hard (Kazemi et al., 2025), a
2023), Global-MMLU-Lite (Singh et al., 2024b),        challenging LLM reasoning benchmark that aggre-
WMT24++ (Deutsch et al., 2025), FLoRes (Goyal         gates several reasoning tasks (Fatemi et al., 2024;


                                                                                                      21
                                          Gemma 3 Technical Report



                   Gemma 3 PT        Gemma 3 IT
        Context 4B 12B 27B          4B 12B 27B
RULER 32K 67.1 90.6 85.9 61.4 80.3 91.1
RULER 128K 51.7 80.7 72.9 46.8 57.1 66.0
MRCR      32K 44.7 59.8 63.2 49.8 53.7 63.2
MRCR     128K 40.6 56.9 60.0 44.6 49.8 59.3

Table 15 | Performance of pre-trained (PT) and
instruction fine-tuned (IT) models on long context
benchmarks at different context lengths.

                            4B     12B     27B
   MMMU (val)              48.8    59.6   64.9
   DocVQA                  75.8    87.1   86.6
   InfoVQA                 50.0    64.9   70.6
   TextVQA                 57.8    67.7   65.1
   AI2D                    74.8    84.2   84.5
   ChartQA                 68.8    75.7   78.0                                      4B     12B    27B
   VQAv2 (val)             62.4    71.6   71.0           Perception Test MCVQA     50.6    54.9   58.1
   MathVista (testmini)    50.0    62.9   67.6           ActivityNet-QA            46.3    50.4   52.8

Table 16 | Performance of instruction fine-tuned       Table 17 | Performance of instruction fine-tuned
(IT) models on multimodal benchmarks. If not           (IT) models on vision understanding benchmarks
mentioned, these results are on the final test set     using 0 shot with 16 frames linspace. Per-
of each dataset with P&S applied.                      ception Test consists of real-world videos de-
                                                       signed to show perceptually interesting situa-
                                                       tions and we report results on the multiple choice
Hessel et al., 2022; Kazemi et al., 2023, 2024b;
                                                       video QA benchmark in terms of top-1 accuracy.
Kıcıman et al., 2023; Nie et al., 2024; Sánchez
                                                       ActivityNet-QA reports standard gpt-evaluation.
et al., 2024; Shah et al., 2024; Tyen et al., 2023;
White et al., 2024; Yamada et al., 2023; Zhang
et al., 2024). ECLeKTic refers to Goldman et al.
(2025). We report the micro average score. More
evaluation details are described in Table 21.


8.2. Performance of IT models on video under-
     standing

Additional multimodal evaluations. Gemma
3 IT models were evaluated on common vision
benchmarks following the evaluation protocol of
Gemini 1.5 (Gemini Team, 2024). The results are
given in Table 16 when P&S is activated.




                                                                                                        22
                                        Gemma 3 Technical Report




                                  Gemma 2                                 Gemma 3
                            2B       9B        27B                 1B     4B     12B    27B
       MMLU                56.1     71.3       76.2                38.8   58.1   71.9   76.9
       MBPP                36.6     59.2       67.4                35.2   63.2   73.0   74.4
       HumanEval           20.1     40.2       51.8                41.5   71.3   85.4   87.8
       N2C                 46.8     68.3       77.3                56.0   70.3   80.7   84.5
       LiveCodeBench       7.0      20.0       29.0                 5.0   23.0   32.0   39.0
       GSM8K               62.6     88.1       91.1                62.8   89.2   94.4   95.9
       MATH                27.2     49.4       55.6                48.0   75.6   83.8   89.0
       HiddenMath           2.0      8.0       12.0                15.0   42.0   51.0   56.0
       BBH                 41.4     69.0       74.9                39.1   72.2   85.7   87.6
       BBEH                 5.9      9.8       14.8                 7.2   11.0   16.3   19.3
       IFEval              80.4     88.4       91.1                80.2   90.2   88.9   90.4
       GMMLU-Lite          41.9     64.8       68.6                34.2   54.5   69.5   75.1
       ECLeKTic             5.3     11.8       17.6                1.4    4.6    10.3   16.7
       WMT24++             37.4     48.7       51.7                35.9   46.8   51.6   53.4

Table 18 | Performance of instruction fine-tuned (IT) models of different sizes on more internal and
external benchmarks.




                                                                                                  23
                                      Gemma 3 Technical Report




   Evaluation                   Metric                Type       n-shot     COT       Norm
   MBPP                       pass@1                sampling      3-shot
   HumanEval                  pass@1                sampling      0-shot
   HellaSwag                 Accuracy                scoring      10-shot           Char-Len
   BoolQ                     Accuracy                scoring       0-shot           Char-Len
   PIQA                      Accuracy                scoring       0-shot           Char-Len
   SIQA                      Accuracy                scoring       0-shot           Char-Len
   TriviaQA                  Accuracy               sampling      5-shot
   Natural Questions         Accuracy               sampling      5-shot
   ARC-C                     Accuracy                scoring      25-shot           Char-Len
   ARC-E                     Accuracy                scoring       0-shot           Char-Len
   WinoGrande                Accuracy                scoring       5-shot           Char-Len
   BBH                       Accuracy               sampling     few-shot   Yes
   DROP                    Token F1 score           sampling      1-shot
   AGIEval                   Accuracy               sampling     3-5-shot
   MMLU                      Accuracy                scoring       5-shot           Char-Len
   MATH                      Accuracy               sampling      4-shot    Yes
   GSM8K                     Accuracy               sampling      8-shot    Yes
   GPQA Diamond              Accuracy               sampling      5-shot    Yes
   MMLU-Pro                  Accuracy               sampling      5-shot    Yes
   MGSM                      Accuracy               sampling      8-shot
   FLoRes               CHaRacter-level F-score     sampling       1-shot
   Global-MMLU-Lite          Accuracy                scoring       5-shot           Char-Len
   XQuAD                CHaRacter-level F-score     sampling       5-shot
   WMT24++              CHaRacter-level F-score     sampling       5-shot
   ECLeKTic                ECLeKTic score           sampling      2-shot          First-line/strip
   XQuAD Indic          CHaRacter-level F-score     sampling       5-shot
   XOR QA IN-EN         CHaRacter-level F-score     sampling       5-shot
   XOR QA IN-XX         CHaRacter-level F-score     sampling       5-shot
   FLoRes Indic         CHaRacter-level F-score     sampling       5-shot
   RULER                     Accuracy               sampling      0-shot
   MRCR                     MRCR score              sampling     few-shot

Table 19 | Details on text benchmarks. Char-Len stands for Character Length Normalization and COT
stands for Chain-Of-Thought prompting.




                                                                                                     24
                                       Gemma 3 Technical Report




                  Evaluation             Metric           Type            n-shot
                  COCO Caption         Cider score     sampling          4-shot
                  DocVQA               ANLS score      sampling          4-shot
                  InfographicVQA       ANLS score      sampling          4-shot
                  MMMU                  Accuracy       sampling      3-shot text only
                  TextVQA               Accuracy       sampling          4-shot
                  RealWorldQA           Accuracy       sampling      4-shot text only
                  ReMI                  Accuracy       sampling          4-shot
                  AI2D                  Accuracy       sampling          4-shot
                  ChartQA               Accuracy       sampling          4-shot
                  VQA v2                Accuracy       sampling          4-shot
                  BLINK                 Accuracy       sampling          0-shot
                  OK-VQA                Accuracy       sampling          4-shot
                  TallyQA               Accuracy       sampling          4-shot
                  SpatialSense VQA      Accuracy       sampling          4-shot
                  CountBench VQA        Accuracy       sampling          0-shot

Table 20 | Details on vision benchmarks. No Chain-Of-Thought prompting nor normalization.




             Evaluation                   Metric                   Type      n-shot     COT
             MMLU                       Accuracy                  sampling   0-shot
             MBPP                        pass@1                   sampling   3-shot
             HumanEval                   pass@1                   sampling   0-shot
             N2C                         pass@1                   sampling   0-shot
             LiveCodeBench        Average over 8 samples          sampling   0-shot     Yes
             GSM8K                      Accuracy                  sampling   0-shot     Yes
             GPQA Diamond               Accuracy                  sampling   0-shot     Yes
             MATH                       Accuracy                  sampling   0-shot
             HiddenMath                 Accuracy                  sampling   0-shot
             BBH                        Accuracy                  sampling   0-shot
             BBEH                       Accuracy                  sampling   0-shot
             IFEval                     Accuracy                  sampling   0-shot
             Global-MMLU-lite           Accuracy                  sampling   0-shot     Yes
             ECLeKTic                ECLeKTic score               sampling   0-shot
             WMT24++              CHaRacter-level F-score         sampling   0-shot

Table 21 | Details on instruction fine-tuned (IT) benchmarks. No normalization.




                                                                                              25