Source notes and reports

Sources

← Home

                                        Gemini: A Family of Highly Capable
                                        Multimodal Models
                                        Gemini Team, Google1



                                        This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities
                                        across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano
                                        sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained
                                        use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model
                                        advances the state of the art in 30 of 32 of these benchmarks — notably being the first model to achieve
                                        human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the




arXiv:2312.11805v5 [cs.CL] 9 May 2025
                                        art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of
                                        the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of
                                        use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to
                                        users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.


                                        1. Introduction
                                        We present Gemini, a family of highly capable multimodal models developed at Google. We trained
                                        Gemini models jointly across image, audio, video, and text data for the purpose of building a model
                                        with both strong generalist capabilities across modalities alongside cutting-edge understanding and
                                        reasoning performance in each respective domain.
                                            Gemini 1.0, our first version, comes in three sizes: Ultra for highly-complex tasks, Pro for enhanced
                                        performance and deployability at scale, and Nano for on-device applications. Each size is specifically
                                        tailored to address different computational limitations and application requirements.
                                            After large-scale pre-training, we post-train our models to improve overall quality, enhance target
                                        capabilities, and ensure alignment and safety criteria are met. Due to the varied requirements of
                                        our downstream applications, we have produced two post-trained Gemini model family variants.
                                        Chat-focused variants, referred to as Gemini Apps models, are optimized for Gemini and Gemini
                                        Advanced, our conversational AI service formerly known as Bard. Developer-focused variants, referred
                                        to as Gemini API models, are optimized for a range of products and are accessible through Google AI
                                        Studio and Cloud Vertex AI.
                                            We evaluate the performance of pre- and post-trained Gemini models on a comprehensive suite
                                        of internal and external benchmarks covering a wide range of language, coding, reasoning, and
                                        multimodal tasks.
                                            The Gemini family advances state-of-the-art in large-scale language modeling (Anil et al., 2023;
                                        Brown et al., 2020; Chowdhery et al., 2023; Hoffmann et al., 2022; OpenAI, 2023a; Radford et al.,
                                        2019; Rae et al., 2021), image understanding (Alayrac et al., 2022; Chen et al., 2022; Dosovitskiy
                                        et al., 2020; OpenAI, 2023b; Reed et al., 2022; Yu et al., 2022a), audio processing (Radford et al.,
                                        2023; Zhang et al., 2023), and video understanding (Alayrac et al., 2022; Chen et al., 2023). It
                                        also builds on the work on sequence models (Sutskever et al., 2014), a long history of work in deep
                                        learning based on neural networks (LeCun et al., 2015), and machine learning distributed systems
                                           1 See Contributions and Acknowledgments section for full author list. Please send correspondence to gemini-1-
                                        report@google.com



                                        © 2025 Google. All rights reserved
                              Gemini: A Family of Highly Capable Multimodal Models



(Barham et al., 2022; Bradbury et al., 2018; Dean et al., 2012) that enable large-scale training.
    Our most capable model, Gemini Ultra, achieves new state-of-the-art results in 30 of 32 benchmarks
we report on, including 10 of 12 popular text and reasoning benchmarks, 9 of 9 image understanding
benchmarks, 6 of 6 video understanding benchmarks, and 5 of 5 speech recognition and speech
translation benchmarks. Gemini Ultra is the first model to achieve human-expert performance on
MMLU (Hendrycks et al., 2021a) — a prominent benchmark testing knowledge and reasoning via a
suite of exams — with a score above 90%. Beyond text, Gemini Ultra makes notable advances on
challenging multimodal reasoning tasks. For example, on the recent MMMU benchmark (Yue et al.,
2023), that comprises questions about images on multi-discipline tasks requiring college-level subject
knowledge and deliberate reasoning, Gemini Ultra achieves a new state-of-the-art score of 62.4%,
outperforming the previous best model by more than 5 percentage points. It provides a uniform
performance lift for video question answering and audio understanding benchmarks.
     Qualitative evaluation showcases impressive crossmodal reasoning capabilities, enabling the model
to understand and reason across an input sequence of audio, images, and text natively (see Figure 5
and Table 13). Consider the educational setting depicted in Figure 1 as an example. A teacher has
drawn a physics problem of a skier going down a slope, and a student has worked through a solution to
it. Using Gemini models’ multimodal reasoning capabilities, the model is able to understand the messy
handwriting, correctly understand the problem formulation, convert both the problem and solution
to mathematical typesetting, identify the specific step of reasoning where the student went wrong in
solving the problem, and then give a worked through correct solution to the problem. This opens up
exciting educational possibilities, and we believe the new multimodal and reasoning capabilities of
Gemini models have dramatic applications across many fields.
    The reasoning capabilities of large language models show promise toward building generalist
agents that can tackle more complex multi-step problems. The AlphaCode team built AlphaCode
2 (Leblond et al, 2023), a new Gemini-model-powered agent, that combines Gemini models’ rea-
soning capabilities with search and tool-use to excel at solving competitive programming problems.
AlphaCode 2 ranks within the top 15% of entrants on the Codeforces competitive programming
platform, a large improvement over its state-of-the-art predecessor in the top 50% (Li et al., 2022).
    In tandem, we advance the frontier of efficiency with Gemini Nano, a series of small models
targeting on-device deployment. These models excel in on-device tasks, such as summarization,
reading comprehension, text completion tasks, and exhibit impressive capabilities in reasoning, STEM,
coding, multimodal, and multilingual tasks relative to their sizes.
    In the following sections, we first provide an overview of the model architecture, training infras-
tructure, and pre-training dataset. We then present detailed evaluations of the pre- and post-trained
Gemini model family, covering well-studied benchmarks across text, code, image, audio and video —
which include both English performance and multilingual capabilities. Next we discuss our approach
to post-training, highlight common and distinct aspects of the Gemini Apps and Gemini API model
variants, and benchmark their performance on key capabilities. Responsible deployment is critical: we
explain our process for impact assessments, developing model policies, evaluations, and mitigations
of harm before deployment decisions. Finally, we discuss the broader implications of Gemini models,
their limitations alongside their potential applications — paving the way for a new era of research
and innovation in AI.




                                                                                                     2
                              Gemini: A Family of Highly Capable Multimodal Models




Figure 1 | Verifying a student’s solution to a physics problem. The model is able to correctly recognize
all of the handwritten content and verify the reasoning. On top of understanding the text in the
image, it needs to understand the problem setup and correctly follow instructions to generate LATEX.


2. Model Architecture
Gemini models build on top of Transformer decoders (Vaswani et al., 2017b) that are enhanced
with improvements in architecture and model optimization to enable stable training at scale and
optimized inference on Google’s Tensor Processing Units. They are trained to support 32k context
length, employing efficient attention mechanisms (for e.g. multi-query attention (Shazeer, 2019a)).
Our first version, Gemini 1.0, comprises three main sizes to support a wide range of applications as
discussed in Table 1.
    Gemini models are trained to accommodate textual input interleaved with a wide variety of audio
and visual inputs, such as natural images, charts, screenshots, PDFs, and videos, and they can produce
text and image outputs (see Figure 2). The visual encoding of Gemini models is inspired by our own
foundational work on Flamingo (Alayrac et al., 2022), CoCa (Yu et al., 2022a), and PaLI (Chen et al.,
2022), with the important distinction that the models are multimodal from the beginning and can
natively output images using discrete image tokens (Ramesh et al., 2021; Yu et al., 2022b).
    Video understanding is accomplished by encoding the video as a sequence of frames in the large
context window. Video frames or images can be interleaved naturally with text or audio as part of the
model input. The models can handle variable input resolution in order to spend more compute on
tasks that require fine-grained understanding. In addition, Gemini models can directly ingest audio



                                                                                                       3
                              Gemini: A Family of Highly Capable Multimodal Models



 Model size     Model description
 Ultra          Our most capable model that delivers state-of-the-art performance across a wide
                range of highly complex tasks, including reasoning and multimodal tasks. It is
                efficiently serveable at scale on TPU accelerators due to the Gemini architecture.
 Pro            A performance-optimized model in terms of cost as well as latency that delivers
                significant performance across a wide range of tasks. This model exhibits strong
                reasoning performance and broad multimodal capabilities.
 Nano           Our most efficient model, designed to run on-device. We trained two versions of
                Nano, with 1.8B (Nano-1) and 3.25B (Nano-2) parameters, targeting low and high
                memory devices respectively. It is trained by distilling from larger Gemini models. It
                is 4-bit quantized for deployment and provides best-in-class performance.

Table 1 | An overview of the Gemini 1.0 model family.




Figure 2 | Gemini models support interleaved sequences of text, image, audio, and video as inputs
(illustrated by tokens of different colors in the input sequence). They can output responses with
interleaved image and text.


signals at 16kHz from Universal Speech Model (USM) (Zhang et al., 2023) features. This enables the
model to capture nuances that are typically lost when the audio is naively mapped to a text input (for
example, see audio understanding demo on the website).
    Training the Gemini family of models required innovations in training algorithms, dataset, and
infrastructure. For the Pro model, the inherent scalability of our infrastructure and learning algorithms
enable us to complete pre-training in a matter of weeks, leveraging a fraction of the Ultra’s resources.
The Nano series of models leverage additional advancements in distillation and training algorithms
to produce the best-in-class small language models for a wide variety of tasks, such as summarization
and reading comprehension, which power our next generation on-device experiences.


3. Training Infrastructure
We trained Gemini models using TPUv5e and TPUv4 (Jouppi et al., 2023), depending on their sizes
and configuration. Training Gemini Ultra used a large fleet of TPUv4 accelerators owned by Google


                                                                                                         4
                                  Gemini: A Family of Highly Capable Multimodal Models



across multiple datacenters. This represents a significant increase in scale over our prior flagship
model PaLM-2 which presented new infrastructure challenges. Scaling up the number of accelerators
results in a proportionate decrease in the mean time between failure of hardware in the overall system.
We minimized the rate of planned reschedules and preemptions, but genuine machine failures are
commonplace across all hardware accelerators at such large scales.
    TPUv4 accelerators are deployed in “SuperPods” of 4096 chips, each connected to a dedicated
optical switch, which can dynamically reconfigure 4x4x4 chip cubes into arbitrary 3D torus topologies
in around 10 seconds (Jouppi et al., 2023). For Gemini Ultra, we decided to retain a small number of
cubes per superpod to allow for hot standbys and rolling maintenance.
    TPU accelerators primarily communicate over the high speed inter-chip-interconnect, but at
Gemini Ultra scale, we combine SuperPods in multiple datacenters using Google’s intra-cluster and
inter-cluster network (Poutievski et al., 2022; Wetherall et al., 2023; yao Hong et al., 2018). Google’s
network latencies and bandwidths are sufficient to support the commonly used synchronous training
paradigm, exploiting model parallelism within superpods and data-parallelism across superpods.
    The ‘single controller’ programming model of Jax (Bradbury et al., 2018) and Pathways (Barham
et al., 2022) allows a single Python process to orchestrate the entire training run, dramatically
simplifying the development workflow. The GSPMD partitioner (Xu et al., 2021) in the XLA compiler
partitions the training step computation, and the MegaScale XLA compiler (XLA, 2019) pass statically
schedules appropriate collectives so that they maximally overlap with the computation with very little
variation in step time.
    Maintaining a high goodput2 at this scale would have been impossible using the conventional
approach of periodic checkpointing of weights to persistent cluster storage. For Gemini models, we
instead made use of redundant in-memory copies of the model state, and on any unplanned hardware
failures, we rapidly recover directly from an intact model replica. Compared to both PaLM and PaLM-2
(Anil et al., 2023), this provided a substantial speedup in recovery time, despite the significantly
larger training resources being used. As a result, the overall goodput for the largest-scale training job
increased from 85% to 97%.
    Training at unprecedented scale invariably surfaces new and interesting systems failure modes -
and in this instance one of the problems that we needed to address was that of “Silent Data Corruption
(SDC)” (Dixit et al., 2021; Hochschild et al., 2021; Vishwanathan et al., 2015). Although these are
extremely rare, the scale of Gemini models means that we can expect SDC events to impact training
every week or two. Rapidly detecting and removing faulty hardware required several new techniques
that exploit deterministic replay to isolate incorrect computations, combined with proactive SDC
scanners on idle machines and hot standbys. Our fully deterministic infrastructure allowed us to
quickly identify root causes (including hardware failures) during the development leading up to the
Ultra model, and this was a crucial ingredient towards stable training.


4. Pre-Training Dataset
Gemini models are trained on a dataset that is both multimodal and multilingual. Our pre-training
dataset uses data from web documents, books, and code, and includes image, audio, and video data.
   We use the SentencePiece tokenizer (Kudo and Richardson, 2018) and find that training the
tokenizer on a large sample of the entire training corpus improves the inferred vocabulary and
subsequently improves model performance. For example, we find Gemini models can efficiently
  2We define goodput as the time spent computing useful new steps over the elapsed time of the training job.




                                                                                                               5
                               Gemini: A Family of Highly Capable Multimodal Models



tokenize non-Latin scripts which can, in turn, benefit model quality as well as training and inference
speed.
    The number of tokens used to train the largest models were determined following the approach
in Hoffmann et al. (2022). The smaller models are trained for significantly more tokens to improve
performance for a given inference budget, similar to the approach advocated in Touvron et al. (2023a).
    We apply quality filters to all datasets, using both heuristic rules and model-based classifiers.
We also perform safety filtering to remove harmful content based on our policies. To maintain the
integrity of evaluations, we search for and remove any evaluation data that may have been in our
training corpus before using data for training. The final data mixtures and weights were determined
through ablations on smaller models. We stage training to alter the mixture composition during
training – increasing the weight of domain-relevant data towards the end of training. We find that
data quality is an important factor for highly-performing models, and believe that many interesting
questions remain around finding the optimal dataset distribution for pre-training.


5. Evaluation
The Gemini models are natively multimodal, as they are trained jointly across text, image, audio,
and video. One open question is whether this joint training can result in a model which has strong
capabilities in each domain – even when compared to models and approaches that are narrowly
tailored to single domains. We find this to be the case: Gemini models set a new state of the art
across a wide range of text, image, audio, and video benchmarks. ww


5.1. Text

5.1.1. Academic Benchmarks

We compare pre- and post-trained Gemini Pro and Ultra models to a suite of external LLMs and our
previous best model PaLM 2 across a series of text-based academic benchmarks covering reasoning,
reading comprehension, STEM, and coding. We report these results in Table 2. Broadly, we find
that the performance of Gemini Pro outperforms inference-optimized models such as GPT-3.5 and
performs comparably with several of the most capable models available, and Gemini Ultra outperforms
all current models. In this section, we examine some of these findings.
    On MMLU (Hendrycks et al., 2021a), Gemini Ultra can outperform all existing models, achieving
an accuracy of 90.04%. MMLU is a holistic exam benchmark, which measures knowledge across a
set of 57 subjects. Human expert performance is gauged at 89.8% by the benchmark authors, and
Gemini Ultra is the first model to exceed this threshold, with the prior state-of-the-art result at 86.4%.
Achieving high performance requires specialist knowledge across many domains (e.g. law, biology,
history, etc.), alongside reading comprehension and reasoning. We find Gemini Ultra achieves highest
accuracy when used in combination with a chain-of-thought prompting approach (Wei et al., 2022b)
that accounts for model uncertainty. The model produces a chain of thought with k samples, for
example 8 or 32. If there is a consensus above a preset threshold (selected based on the validation
split), it selects this answer, otherwise it reverts to a greedy sample based on maximum likelihood
choice without chain of thought. We refer the reader to appendix for a detailed breakdown of how
this approach compares with only chain-of-thought prompting or only greedy sampling.
    In mathematics, a field commonly used to benchmark the analytical capabilities of models, Gemini
Ultra shows strong performance on both elementary exams and competition-grade problem sets. For
the grade-school math benchmark, GSM8K (Cobbe et al., 2021), we find Gemini Ultra reaches 94.4%


                                                                                                        6
                              Gemini: A Family of Highly Capable Multimodal Models



accuracy with chain-of-thought prompting and self-consistency (Wang et al., 2022) compared to
the previous best accuracy of 92% with the same prompting technique. Similar positive trends are
observed in increased difficulty math problems drawn from middle- and high-school math competitions
(MATH benchmark), with the Gemini Ultra model outperforming all competitor models, reaching
53.2% using 4-shot prompting. The model also outperforms the state of the art on even harder tasks
derived from American Mathematical Competitions (150 questions from 2022 and 2023). Smaller
models perform poorly on this challenging task scoring close to random, but Gemini Ultra can solve
32% of the questions, compared to the 30% solve rate for GPT-4.
    Gemini Ultra also excels in coding, a popular use case of current LLMs. We evaluate the model
on many conventional and internal benchmarks and also measure its performance as part of more
complex reasoning systems such as AlphaCode 2 (see Section 5.1.7 on complex reasoning systems).
For example, on HumanEval, a standard code-completion benchmark (Chen et al., 2021) mapping
function descriptions to Python implementations, instruction-tuned Gemini Ultra correctly implements
74.4% of problems. On a new held-out evaluation benchmark for python code generation tasks,
Natural2Code, where we ensure no web leakage, Gemini Ultra achieves the highest score of 74.9%.
    Evaluation on these benchmarks is challenging and may be affected by data contamination. We
performed an extensive leaked data analysis after training to ensure the results we report here are as
scientifically sound as possible, but still found some minor issues and decided not to report results on
e.g. LAMBADA (Paperno et al., 2016). As part of the evaluation process, on a popular benchmark,
HellaSwag (Zellers et al., 2019), we find that an additional hundred fine-tuning steps on specific
website extracts corresponding to the HellaSwag training set (which were not included in the Gemini
model pretraining set) improve the validation accuracy of Gemini Pro to 89.6% and Gemini Ultra to
96.0%, when measured with 1-shot prompting (we measured GPT-4 obtained 92.3% when evaluated
1-shot via the API). This suggests that the benchmark results are susceptible to the pretraining dataset
composition. We choose to report HellaSwag decontaminated results only in a 10-shot evaluation
setting. We believe there is a need for more robust and nuanced standardized evaluation benchmarks
with no leaked data. So, we evaluate Gemini models on several new held-out evaluation datasets
that were recently released, such as WMT23 and Math-AMC 2022-2023 problems, or internally
generated from non-web sources, such as Natural2Code. We refer the reader to Appendix 10.3 for a
comprehensive list of our evaluation benchmarks.
   Even so, model performance on these benchmarks gives us an indication of the model capabilities
and where they may provide impact on real-world tasks. For example, Gemini Ultra’s impressive
reasoning and STEM competencies pave the way for advancements in LLMs within the educational
domain3 . The ability to tackle complex mathematical and scientific concepts opens up exciting
possibilities for personalized learning and intelligent tutoring systems.

5.1.2. Trends in Capabilities

We investigate the trends in capabilities across the Gemini model family by evaluating them on a
holistic harness of more than 50 benchmarks in six different capabilities, noting that some of the
most notable benchmarks were discussed in the last section. These capabilities are: “Factuality”
covering open/closed-book retrieval and question answering tasks; “Long-Context” covering long-
form summarization, retrieval and question answering tasks; “Math/Science” including tasks for
mathematical problem solving, theorem proving, and scientific exams; “Reasoning” tasks that require
arithmetic, scientific, and commonsense reasoning; “Multilingual” tasks for translation, summarization,
and reasoning in multiple languages. Several of these capabilities are targeted by post-training
(Section 6). Please see Appendix 10.3 for a detailed list of tasks included for each capability.
  3 See demos on website https://deepmind.google/gemini.




                                                                                                      7
                                             Gemini: A Family of Highly Capable Multimodal Models



                              Gemini      Gemini       GPT-4           GPT-3.5        PaLM 2-L   Claude 2     Inflect-   Grok 1   LLAMA-2
                              Ultra       Pro                                                                 ion-2
 MMLU                         90.04%      79.13%       87.29%          70%            78.4%      78.5%        79.6%      73.0%    68.0%∗∗∗
 Multiple-choice questions    CoT@32∗     CoT@8∗      CoT@32           5-shot         5-shot     5-shot CoT   5-shot     5-shot
 in 57 subjects                                       (via API∗∗ )
 (professional &
 academic)                    83.7%       71.8%        86.4%
 (Hendrycks et al., 2021a)    5-shot      5-shot      5-shot
                                                      (reported)

 GSM8K                        94.4%       86.5%        92.0%           57.1%          80.0%      88.0%        81.4%      62.9%    56.8%
 Grade-school math            Maj1@32     Maj1@32      SFT &           5-shot         5-shot     0-shot       8-shot     8-shot   5-shot
 (Cobbe et al., 2021)                                  5-shot CoT

 MATH                         53.2%       32.6%        52.9%           34.1%          34.4%      —            34.8%      23.9%    13.5%
 Math problems across         4-shot      4-shot      4-shot           4-shot         4-shot                             4-shot   4-shot
 5 difficulty levels &                                (via API∗∗ )     (via API∗∗ )
 7 subdisciplines
 (Hendrycks et al., 2021b)                             50.3%
                                                      (Zheng et al.,
                                                      2023)

 BIG-Bench-Hard               83.6%       75.0%        83.1%           66.6%          77.7%      —            —          —        51.2%
 Subset of hard BIG-bench     3-shot      3-shot      3-shot           3-shot         3-shot                                      3-shot
 tasks written as CoT prob-                           (via API∗∗ )     (via API∗∗ )
 lems
 (Srivastava et al., 2022)

 HumanEval                    74.4%       67.7%        67.0%           48.1%          —          70.0%        44.5%      63.2%    29.9%
 Python coding tasks          0-shot      0-shot      0-shot           0-shot                    0-shot       0-shot     0-shot   0-shot
 (Chen et al., 2021)          (PT∗∗∗∗ )   (PT∗∗∗∗ )   (reported)

 Natural2Code                 74.9%       69.6%        73.9%           62.3%          —          —            —          —        —
 Python code generation.      0-shot      0-shot      0-shot           0-shot
 (New held-out set with no                            (via API∗∗ )     (via API∗∗ )
 leakage on web)

 DROP                         82.4        74.1         80.9            64.1           82.0       —            —          —        —
 Reading comprehension        Variable    Variable    3-shot           3-shot         Variable
 & arithmetic.                shots       shots       (reported)                      shots
 (metric: F1-score)
 (Dua et al., 2019)

 HellaSwag                    87.8%       84.7%        95.3%           85.5%          86.8%      —            89.0%      —        80.0%∗∗∗
 (validation set)             10-shot     10-shot     10-shot          10-shot        10-shot                 10-shot
 Common-sense multiple                                (reported)
 choice questions
 (Zellers et al., 2019)

 WMT23                        74.4        71.7         73.8            —              72.7       —            —          —        —
 Machine translation (met-    1-shot      1-shot      1-shot                          1-shot
 ric: BLEURT)                 (PT∗∗∗∗ )               (via API∗∗ )
 (Tom et al., 2023)



Table 2 | Gemini performance on text benchmarks with external comparisons and PaLM 2-L.
∗ The model produces a chain of thought with k = 8 or 32 samples, if there is a consensus above a threshold (chosen based on the validation

split), it selects this answer, otherwise it reverts to a greedy sample. Further analysis in Appendix 10.2.
∗∗ Results self-collected via the API in Nov, 2023.
∗∗∗ Results shown use the decontaminated numbers from Touvron et al. (2023b) report as the most relevant comparison to Gemini models

which have been decontaminated as well.)
∗∗∗∗ PT denotes a post-trained Gemini API model.




     We observe consistent quality gains with increased model size in Figure 3, especially in reasoning,
math/science, summarization and long-context. Gemini Ultra is the best model across the board for
all six capabilities. Gemini Pro, the second-largest model in the Gemini family of models, is also quite
competitive while being a lot more efficient to serve.

5.1.3. Nano

Bringing AI closer to the user, we discuss the Gemini Nano 1 and Nano 2 models engineered for
on-device deployments. These models excel in summarization and reading comprehension tasks with
per-task fine-tuning. Figure 3 shows the performance of these pre-trained models in comparison
to the much larger Gemini Pro model, while Table 3 dives deeper into specific factuality, coding,
Math/Science, and reasoning tasks. Nano-1 and Nano-2 model sizes are only 1.8B and 3.25B
parameters respectively. Despite their size, they show exceptionally strong performance on factuality,
i.e. retrieval-related tasks, and significant performance on reasoning, STEM, coding, multimodal and


                                                                                                                                             8
                                                             Gemini: A Family of Highly Capable Multimodal Models


                                          1.4




          Normalized Performance vs Pro
                                                                                                                                          Nano 1
                                          1.2                                                                                             Nano 2
                                          1.0                                                                                             Pro
                                                                                                                                          Ultra
                                          0.8
                                          0.6
                                          0.4
                                          0.2
                                          0.0
                                                   lity           t
                                                                 tex          nce                ion     son                  ua
                                                                                                                                ty
                                                Fac           on           Sc ie             zat            ing                  li
                                                   tua     g-C           th/              ari          Re a               ltiling
                                                          Lo n         Ma              mm                             Mu
                                                                                      Su

Figure 3 | Language understanding and generation performance of Gemini model family across
different capabilities (normalized by the Gemini Pro model).


multilingual tasks. With new capabilities accessible to a broader set of platforms and devices, the
Gemini models expand accessibility to everyone.

                                                                                        Gemini Nano 1            Gemini Nano 2
                                                                                      accuracy    normalized   accuracy      normalized
                                                                                                  by Pro                     by Pro

                                                     BoolQ                            71.6        0.81         79.3          0.90
                                                     TydiQA (GoldP)                   68.9        0.85         74.2          0.91
                                                     NaturalQuestions (Retrieved)     38.6        0.69         46.5          0.83
                                                     NaturalQuestions (Closed-book)   18.8        0.43         24.8          0.56
                                                     BIG-Bench-Hard (3-shot)          34.8        0.47         42.4          0.58
                                                     MBPP                             20.0        0.33         27.2          0.45
                                                     MATH (4-shot)                    13.5        0.41         22.8          0.70
                                                     MMLU (5-shot)                    45.9        0.64         55.8          0.78


Table 3 | Performance of Gemini Nano series on factuality, summarization, reasoning, coding and
STEM tasks compared to significantly larger Gemini Pro model.


5.1.4. Multilinguality

The multilingual capabilities of the Gemini models are evaluated using a diverse set of tasks requir-
ing multilingual understanding, cross-lingual generalization, and the generation of text in multiple
languages. These tasks include machine translation benchmarks (WMT 23 for high-medium-low
resource translation; Flores, NTREX for low and very low resource languages), summarization bench-
marks (XLSum, Wikilingua), and translated versions of common benchmarks (MGSM: professionally
translated into 11 languages).


5.1.4.1       Machine Translation

Translation is a canonical benchmark in machine learning with a rich history. We evaluated a post-
trained Gemini API Ultra model (see Section 6.5.3) on the entire set of language pairs in the WMT 23
translation benchmark in a few-shot setting. Overall, we found that Gemini Ultra (and other Gemini
models) performed remarkably well at translating from English to any other language, and surpassed


                                                                                                                                                   9
                                   Gemini: A Family of Highly Capable Multimodal Models



the LLM-based translation methods when translating out-of-English, on high-resource, mid-resource
and low-resource languages. In the WMT 23 out-of-English translation tasks, Gemini Ultra achieved
the highest LLM-based translation quality, with an average BLEURT (Sellam et al., 2020) score of 74.8,
compared to GPT-4’s score of 73.6, and PaLM 2’s score of 72.2. When averaged across all language
pairs and directions for WMT 23, we see a similar trend with Gemini Ultra 74.4, GPT-4 73.8 and
PaLM 2-L 72.7 average BLEURT scores on this benchmark.

          WMT 23                Gemini Ultra   Gemini Pro      Gemini Nano 2   Gemini Nano 1      GPT-4   PaLM 2-L
          (Avg BLEURT)
          High Resource         74.2           71.7            67.7            64.1               74.0    72.6
          Mid Resource          74.7           71.8            67.0            64.8               73.6    72.7
          Out-of-English        74.8           71.5            66.2            65.2               73.6    72.2
          Into-English          73.9           72.0            69.0            63.5               74.1    73.4
          All languages         74.4           71.7            67.4            64.8               73.8    72.7


Table 4 | Performance of Gemini models on WMT 23 translation benchmark. All numbers with 1-shot.

    In addition to the languages and translation tasks above, we also evaluate Gemini Ultra on very
low-resource languages. These languages were sampled from the tail of the following language sets:
Flores-200 (Tamazight and Kanure), NTREX (North Ndebele), and an internal benchmark (Quechua).
For these languages, both from and into English, Gemini Ultra achieved an average chrF score of 27.0
in 1-shot setup, while the next-best model, PaLM 2-L, achieved a score of 25.3.


5.1.4.2    Multilingual Math and Summarization

Beyond translation, we evaluated how well Gemini models perform in challenging tasks across a
range of languages. We specifically investigated the math benchmark MGSM (Shi et al., 2023), which
is a translated variant of the math benchmark GSM8K (Cobbe et al., 2021). We find Gemini Ultra
achieves an accuracy of 79.0%, an advance over PaLM 2-L which scores 74.7%, when averaged
across all languages in an 8-shot setup. We also benchmark Gemini models on the multilingual
summarization benchmarks – XLSum (Hasan et al., 2021) and WikiLingua (Ladhak et al., 2020). In
XLSum, Gemini Ultra reached an average of 17.6 rougeL score compared to 15.4 for PaLM 2. For
Wikilingua, Gemini Ultra (5-shot) trails behind PaLM 2 (3-shot) measured in BLEURT score. See
Table 5 for the full results. Overall the diverse set of multilingual benchmarks show that Gemini
family models have a broad language coverage, enabling them to also reach locales and regions with
low-resource languages.

                                                Gemini Ultra      Gemini Pro   GPT-4   PaLM 2-L
                           MGSM (8-shot)        79.0              63.5         74.5       74.7
                           XLsum (3-shot)       17.6              16.2          —         15.4
                           Wikilingua           48.9              47.8          —         50.4


Table 5 | Performance of Gemini models on multilingual math and summarization.



5.1.5. Long Context

Gemini models are trained with a sequence length of 32,768 tokens and we find that they make use
of their context length effectively. We first verify this by running a synthetic retrieval test: we place
key-value pairs at the beginning of the context, then add long filler text, and ask for value associated
with a particular key. We find that the Ultra model retrieves the correct value with 98% accuracy
when queried across the full context length. We further investigate this by plotting the negative log


                                                                                                                     10
                                   Gemini: A Family of Highly Capable Multimodal Models



likelihood (NLL) versus the token index across a held-out set of long documents in Figure 4. We
find that the NLL decreases with sequence position up to the full 32K context length. The longer
context length of Gemini models enable new use cases such as retrieval over documents and video
understanding discussed in Section 5.2.2.

                                                                                           Pro
                                                                                           Ultra

                    NLL



                          8   16     32    64   128 256 512       1K     2K   4K   8K     16K 32K
                                                     Sequence position

Figure 4 | Negative log likelihood as a function of token index across 32K context length on a held-out
set of long documents.



5.1.6. Factuality

Factuality (Maynez et al., 2020) is a key focus of our model’s training and deployment. We evaluate
three aspects of factuality for our Gemini API models:

   1. Closed-Book Factuality: If provided with a fact-seeking prompt without any given source,
      Gemini API models should not hallucinate incorrect information (see Section 2 of Roberts et al.
      (2020) for a definition). These prompts can range from information-seeking prompts (e.g. “Who
      is the prime minister of India?”) to semi-creative prompts that may request factual information
      (e.g. “Write a 500-word speech in favor of the adoption of renewable energy”).
   2. Attribution: If instructed to generate a response grounded to a given context, we aim to ensure
      that Gemini API models produce a response with the highest degree of faithfulness to the
      context (Maynez et al., 2020; Rashkin et al., 2023). This may include the summarization of a
      user-provided source, generating fine-grained citations given a question and provided snippets
      akin to Menick et al. (2022); Peng et al. (2023), answering questions from a long-form source
      such as a book (Mihaylov et al., 2018), and transforming a given source to a desired output
      (e.g. an email from a portion of a meeting transcript).
   3. Hedging: If prompted with an input that is “unanswerable”, Gemini API models must ac-
      knowledge that it cannot provide a response by hedging to avoid hallucination. These include
      scenarios where the input prompt contains false-premise questions [see examples in Hu et al.
      (2023)], the input prompt instructs the model to perform open book QA, but the answer is not
      derivable from the given context, and so forth.

    Factuality is evaluated via human annotators who fact-check each response manually; we report
the percentage of factually inaccurate responses as judged by annotators. Attribution is evaluated via
human annotators who check for attribution to sources in the prompt for each response manually;
the reported metric is AIS (Rashkin et al., 2023). For hedging, we use an automatic evaluation setup
where we measure whether models hedge accurately.
   We compare Gemini API Pro with a version without any factuality-focused adaptation in Table 6.
We see that the rate of inaccuracy is halved in the factuality set, the accuracy of attribution is increased



                                                                                                         11
                                      Gemini: A Family of Highly Capable Multimodal Models



by 50% from the attribution set, and the model successfully hedges 70% (up from 0%) in the provided
hedging set task.

                                           Factuality                Attribution             Hedging
                                           (Inaccurate Rate)         (AIS)                   (Accuracy)

       Gemini API Pro                      6.7%                      40.2%                   0%
       No factuality-focused adaptation    [5.8%, 7.8%]              [37.9%, 42.5%]


       Gemini API Pro                      3.8%                      60.0%                   69.3%
       Final stage of post-training        [3.1%, 4.8%]              [57.6%, 62.1%]



Table 6 | Factuality mitigations: Impact of post-training on the rate of inaccuracy, presence of attribution
and the rate of accurate hedging on Gemini API Pro (with corresponding 95% confidence intervals).


5.1.7. Complex Reasoning Systems

Gemini models can also be combined with additional techniques such as search and tool-use to create
powerful reasoning systems that can tackle more complex multi-step problems. One example of such
a system is AlphaCode 2, a new state-of-the-art agent that excels at solving competitive programming
problems (Leblond et al, 2023). AlphaCode 2 uses a specialized version of Gemini Pro – tuned on
competitive programming data similar to the data used in Li et al. (2022) – to conduct a massive
search over the space of possible programs. This is followed by a tailored filtering, clustering and
reranking mechanism. Gemini Pro is fine-tuned both to be a coding model to generate proposal
solution candidates, and to be a reward model that is leveraged to recognize and extract the most
promising code candidates.
    AlphaCode 2 is evaluated on Codeforces,4 the same platform as AlphaCode, on 12 contests from
division 1 and 2, for a total of 77 problems. AlphaCode 2 solved 43% of these competition problems, a
1.7x improvement over the prior record-setting AlphaCode system which solved 25%. Mapping this to
competition rankings, AlphaCode 2 built on top of Gemini Pro sits at an estimated 85th percentile on
average – i.e. it performs better than 85% of entrants. This is a significant advance over AlphaCode,
which only outperformed 50% of competitors.
    The composition of powerful pre-trained models with search and reasoning mechanisms is an
exciting direction towards more general agents; another key ingredient is deep understanding across
a range of modalities which we discuss in the next section.




  4 http://codeforces.com/




                                                                                                          12
                                                  Gemini: A Family of Highly Capable Multimodal Models



5.2. Multimodal

Gemini models are natively multimodal. These models exhibit the unique ability to seamlessly
combine their capabilities across modalities (e.g. extracting information and spatial layout out of
a table, a chart, or a figure) with the strong reasoning capabilities of a language model (e.g. its
state-of-art-performance in math and coding) as seen in examples in Figures 5 and 14. The models
also show strong performance in discerning fine-grained details in inputs, aggregating context across
space and time, and applying these capabilities over a temporally-related sequence of video frames
and/or audio inputs.
   The sections below provide more detailed evaluation of the model across different modalities
(image, video, and audio), together with qualitative examples of the model’s capabilities for image
generation and the ability to combine information across different modalities.

5.2.1. Image Understanding

We evaluate post-trained Gemini API models on four different capabilities: high-level object recogni-
tion using captioning or question-answering tasks such as VQAv2; fine-grained transcription using
tasks such as TextVQA and DocVQA requiring the model to recognize low-level details; chart un-
derstanding requiring spatial understanding of input layout using ChartQA and InfographicVQA
tasks; and multimodal reasoning using tasks such as Ai2D, MathVista and MMMU. For zero-shot QA
evaluation, the model is instructed to provide short answers aligned with the specific benchmark. All
numbers are obtained using greedy sampling and without any use of external OCR tools.

                                           Gemini          Gemini         Gemini         Gemini          GPT-4V         Prior SOTA
                                           Ultra           Pro            Nano 2         Nano 1
                                           (pixel only)    (pixel only)   (pixel only)   (pixel only)

 MMMU (val)                                59.4%           47.9%          32.6%          26.3%           56.8%          56.8%
 Multi-discipline college-level problems   pass@1                                                                       GPT-4V, 0-shot
 (Yue et al., 2023)
                                           62.4%
                                           Maj1@32

 TextVQA (val)                             82.3%           74.6%          65.9%          62.5%           78.0%           79.5%
 Text reading on natural images                                                                                         Google PaLI-3, fine-tuned
 (Singh et al., 2019)

 DocVQA (test)                             90.9%           88.1%          74.3%          72.2%           88.4%          88.4%
 Document understanding                                                                                  (pixel only)   GPT-4V, 0-shot
 (Mathew et al., 2021)

 ChartQA (test)                            80.8%           74.1%          51.9%          53.6%           78.5%          79.3%
 Chart understanding                                                                                     (4-shot CoT)   Google DePlot, 1-shot PoT
 (Masry et al., 2022)                                                                                                   (Liu et al., 2023)

 InfographicVQA (test)                     80.3%           75.2%          54.5%          51.1%           75.1%          75.1%
 Infographic understanding                                                                               (pixel only)   GPT-4V, 0-shot
 (Mathew et al., 2022)

 MathVista (testmini)                      53.0%           45.2%          30.6%          27.3%           49.9%          49.9%
 Mathematical reasoning                                                                                                 GPT-4V, 0-shot
 (Lu et al., 2023)

 AI2D (test)                               79.5%           73.9%          51.0%          37.9%           78.2%           81.4%
 Science diagrams                                                                                                       Google PaLI-X, fine-tuned
 (Kembhavi et al., 2016)

 VQAv2 (test-dev)                          77.8%           71.2%          67.5%          62.7%           77.2%           86.1%
 Natural image understanding                                                                                            Google PaLI-X, fine-tuned
 (Goyal et al., 2017)



Table 7 | Image understanding Gemini Ultra consistently outperforms existing approaches even in
zero-shot, especially for OCR-related image understanding tasks for natural images, text, documents,
and figures without using any external OCR engine (‘pixel only’). Many existing approaches fine-tune
on the respective tasks, highlighted in gray, which makes the comparison with 0-shot not apples-to-
apples.



                                                                                                                                                    13
                                   Gemini: A Family of Highly Capable Multimodal Models



    We find that Gemini Ultra is state of the art across a wide range of image-understanding bench-
marks in Table 7. It achieves strong performance across a diverse set of tasks such as answering
questions on natural images and scanned documents as well as understanding infographics, charts
and science diagrams. When compared against publicly reported results from other models (most
notably GPT-4V), the Gemini model is better in zero-shot evaluation by a significant margin. It also
exceeds several existing models that are specifically fine-tuned on the benchmark’s training sets for
the majority of tasks. The capabilities of the Gemini models lead to significant improvements in the
state of the art on academic benchmarks like MathVista (+3.1%)5 or InfographicVQA (+5.2%).
    MMMU (Yue et al., 2023) is a recently released evaluation benchmark, which consists of questions
about images across 6 disciplines with multiple subjects within each discipline that require college-
level knowledge to solve these questions. Gemini Ultra achieves the best score on this benchmark
advancing the state-of-the-art result by more than 5 percentage points and outperforms the previous
best result in 5 of 6 disciplines (see Table 8), thus showcasing its multimodal reasoning capabilities.

                         MMMU (val)                    Gemini Ultra (0-shot)      GPT-4V (0-shot)
                                                       Maj@32        pass@1          pass@1

                         Art & Design                  74.2          70.0            65.8
                         Business                      62.7          56.7            59.3
                         Science                       49.3          48.0            54.7
                         Health & Medicine             71.3          67.3            64.7
                         Humanities & Social Science   78.3          78.3            72.5
                         Technology & Engineering      53.0          47.1            36.7
                         Overall                       62.4          59.4            56.8


Table 8 | Gemini Ultra performance on the MMMU benchmark (Yue et al., 2023) per discipline.
Each discipline covers multiple subjects, requiring college-level knowledge and complex reasoning.

    Gemini models are also capable of operating across modalities and a diverse set of global languages
simultaneously, both for image understanding tasks (e.g., images containing text in Icelandic) and for
generation tasks (e.g., generating image descriptions for a wide range of languages). We evaluate the
performance of generating image descriptions on a selected subset of languages in the Crossmodal-
3600 (XM-3600) benchmark in a 4-shot setting, using the Flamingo evaluation protocol (Alayrac
et al., 2022), without any fine-tuning for all models. As shown in Table 9, Gemini models achieve a
significant improvement over the existing best model, Google PaLI-X.

              XM-3600 (CIDER)               Gemini Ultra           Gemini Pro             Google PaLI-X
                                            4-shot                 4-shot                 4-shot
              English                       86.4                   87.1                   77.8
              French                        77.9                   76.7                   62.5
              Hindi                         31.1                   29.8                   22.2
              Modern Hebrew                 54.5                   52.6                   38.7
              Romanian                      39.0                   37.7                   30.2
              Thai                          86.7                   77.0                   56.0
              Chinese                       33.3                   30.2                   27.7
              Average (of 7)                58.4                   55.9                   45.0


Table 9 | Multilingual image understanding Gemini models outperform existing models in captioning
images in many languages when benchmarked on a subset of languages in XM-3600 dataset (Thapliyal
et al., 2022).


  5 MathVista is a comprehensive mathematical reasoning benchmark consisting of 28 previously published multimodal
datasets and three newly created datasets. Our MathVista results were obtained by running the MathVista authors’
evaluation script.



                                                                                                               14
                              Gemini: A Family of Highly Capable Multimodal Models




Figure 5 | Using Gemini models’ multimodal reasoning capabilities to generate matplotlib code
for rearranging the subplots. The multimodal prompt is shown at the top-left in gray. Gemini Ultra’s
response, including its generated code, is shown in the right column in blue. The bottom left figure
shows rendered version of the generated code. Successfully solving this task shows the model’s
capability to combine several capabilities: (1) recognition of the functions depicted in the plots; (2)
inverse graphics to infer the code that would have generated the subplots; (3) instruction-following
to put subplots in their desired positions; and (4) abstract reasoning to infer that the exponential plot
must stay in its original place, because the sine plot must move out of the way for the 3-dimensional
plot.

   Qualitative evaluation in Figure 5 illustrates an example of Gemini Ultra’s multimodal reasoning
capabilities. The model is required to solve the task of generating matplotlib code that would rearrange


                                                                                                      15
                                           Gemini: A Family of Highly Capable Multimodal Models



a set of subplots provided by the user. The model output shows that it successfully solves this task
combining multiple capabilities of understanding the user plot, inferring the code required to generate
it, following user instructions to put subplots in their desired positions, and abstract reasoning about
the output plot. This highlights Gemini Ultra’s native multimodality and alludes to its more complex
reasoning abilities across interleaved sequences of image and text. We refer the reader to the appendix
for more qualitative examples.

5.2.2. Video Understanding

Understanding video input is an important step towards a useful generalist agent. We measure the
video understanding capability across several established benchmarks that are held-out from training.
These tasks measure whether the model is able to understand and reason over a temporally-related
sequence of frames. For each video task, we sample 16 equally-spaced frames from each video clip
and feed them to the Gemini models. For the YouTube video datasets (all datasets except NextQA
and the Perception test), we evaluate the Gemini models on videos that were still publicly available
in the month of November, 2023.
   Gemini Ultra achieves state-of-the-art performance on various few-shot video captioning tasks
as well as zero-shot video question answering tasks as shown in Table 10. This demonstrates its
capability of strong temporal reasoning across several frames. Figure 23 in the appendix provides a
qualitative example of understanding the video of the ball-striking mechanics of a soccer player and
reasoning about the player can improve their game.

        Task                                          Gemini Ultra          Gemini Pro            Few-shot SoTA
        VATEX (test)                                  62.7                  57.4                  56.0
        English video captioning                      4-shots               4-shots               DeepMind Flamingo, 4-shots
        (Wang et al., 2019)

        VATEX ZH (test)                               51.3                  50.0                  –
        Chinese video captioning                      4-shots               4-shots
        (Wang et al., 2019)

        YouCook2 (val)                                135.4                 123.2                 74.5
        English cooking video captioning              4-shots               4-shots               DeepMind Flamingo, 4-shots
        (Zhou et al., 2018)

        NextQA (test)                                 29.9                  28.0                  26.7
        Video question answering                      0-shot                0-shot                DeepMind Flamingo, 0-shot
        (Xiao et al., 2021)

        ActivityNet-QA (test)                         52.2                  49.8                  45.3
        Video question answering                      0-shot                0-shot                Video-LLAVA, 0-shot
        (Yu et al., 2019)

        Perception Test MCQA (test)                   54.7                  51.1                  46.3
        Video question answering                      0-shot                0-shot                SeViLA (Yu et al., 2023), 0-shot
        (Pătrăucean et al., 2023)



Table 10 | Few-shot video understanding across tasks and languages on selected academic
benchmarks. The reported metric is CIDER for video captioning, WUPS for NextQA, and top-1
accuracy for the Perception Test and ActivityNet-QA. For ActivityNet-QA, we use the Video-LLAVA
(Lin et al., 2023) evaluation protocol.


5.2.3. Image Generation

Gemini models are able to output images natively, without having to rely on an intermediate natural
language description that can bottleneck the model’s ability to express images. This uniquely enables
the model to generate images with prompts using interleaved sequences of image and text in a



                                                                                                                                     16
                             Gemini: A Family of Highly Capable Multimodal Models



few-shot setting. For example, the user might prompt the model to design suggestions of images and
text for a blog post or a website (see Figure 12 in the appendix).
    Figure 6 shows an example of image generation in 1-shot setting. Gemini Ultra model is prompted
with one example of interleaved image and text where the user provides two colors (blue and yellow)
and image suggestions of creating a cute blue cat or a blue dog with yellow ear from yarn. The
model is then given two new colors (pink and green) and asked for two ideas about what to create
using these colors. The model successfully generates an interleaved sequence of images and text with
suggestions to create a cute green avocado with pink seed or a green bunny with pink ears from yarn.




Figure 6 | Image Generation. Gemini models can output multiple images interleaved with text given
a prompt composed of image and text. In the left figure, Gemini Ultra is prompted in a 1-shot setting
with a user example of generating suggestions of creating cat and dog from yarn when given two
colors, blue and yellow. Then, the model is prompted to generate creative suggestions with two new
colors, pink and green, and it generates images of creative suggestions to make a cute green avocado
with pink seed or a green bunny with pink ears from yarn as shown in the right figure.




                                                                                                   17
                                         Gemini: A Family of Highly Capable Multimodal Models



5.2.4. Audio Understanding

We evaluate the Gemini Nano-1 and Gemini Pro models on a variety of public benchmarks and
compare it with Universal Speech Model (USM) (Zhang et al., 2023) and Whisper (large-v2 (Radford
et al., 2023) or large-v3 (OpenAI, 2023) as indicated). These benchmarks include automatic speech
recognition (ASR) tasks such as FLEURS (Conneau et al., 2023), VoxPopuli, (Wang et al., 2021),
Multi-lingual Librispeech (Pratap et al., 2020), as well as the speech translation task CoVoST 2,
translating different languages into English (Wang et al., 2020). We also report on an internal
benchmark YouTube test set. ASR tasks report a word error rate (WER) metric, where a lower number
is better. Translation tasks report a BiLingual Evaluation Understudy (BLEU) score, where a higher
number is better. FLEURS is reported on 62 languages that have language overlap with the training
data. Four segmented languages (Mandarin, Japanese, Korean and Thai) report character error rate
(CER), instead of WER, similar to Whisper (Radford et al., 2023).
    Table 11 indicates that our Gemini Pro model significantly outperforms the USM and Whisper
models across all ASR and AST tasks, both for English and multilingual test sets. Note that there is a
large gain in FLEURS, compared to USM and Whisper, as our model is also trained with the FLEURS
training dataset. However, training the same model without FLEURS dataset results in a WER of 15.8,
which still outperforms Whisper. Gemini Nano-1 model also outperforms both USM and Whisper on
all datasets except FLEURS. Note that we did not evaluate Gemini Ultra on audio yet, though we
expect better performance from increased model scale.

                                  Task                            Metric        Gemini      Gemini      Whisper           USM
                                                                                Pro         Nano-1      (OpenAI, 2023;    (Zhang et al.,
                                                                                                        Radford et al.,   2023)
                                                                                                        2023)

          Automatic Speech        YouTube                         WER (↓)       4.9%        5.5%        6.5%              6.2%
          Recognition             (en-us)                                                               (v3)

                                  Multilingual                    WER (↓)       4.8%        5.9%        6.2%              7.0 %
                                  Librispeech                                                           (v2)
                                  (en-us)
                                  (Pratap et al., 2020)

                                  FLEURS                          WER (↓)       7.6%        14.2%       17.6%             11.8%
                                  (62 lang)                                                             (v3)
                                  (Conneau et al., 2023)

                                  VoxPopuli                       WER (↓)       9.1%        9.5%        15.9%             13.4%
                                  (14 lang)                                                             (v2)
                                  (Wang et al., 2021)

          Automatic Speech        CoVoST 2                        BLEU (↑)      40.1        35.4        29.1              30.7
          Translation             (21 lang)                                                             (v2)
                                  (Wang et al., 2020)



Table 11 | Speech evaluation results on selected benchmarks for ASR and AST. For ASR, the reported
metric is WER where lower is better. For AST, the reported metric is BLEU where higher is better.
   Table 12 shows further error analysis with USM and Gemini Pro. We find that Gemini Pro produces
more understandable responses, particularly on rare words and proper nouns.

 Domain Truth                                             USM                                      Gemini Pro                              Wav

 Fleurs   Scotturb bus 403 travels regularly to           Scotboard bus four3 traversed regu-      Scotturb bus 403 travels regularly to   ▲)
                                                                                                                                            ))
          Sintra, stopping at Cabo da Roca.               larly to Centra stopping at Cabo de      Sintra, stopping at Cabo da Roca.
                                                          Roga.
 Fleurs   The archipelago lies 120 km north               The archipelago lines 120 km north       The archipelago lies 120 km north       ▲)
                                                                                                                                            ))
          of the Peninsula. The largest is King           of peninsula. The largest is Kingurch    of the Peninsula. The largest is King
          George Island, with the settlement of           island with the settlement of Cua        George Island, with the settlement of
          Villa Las Estrellas.                            Losas.                                   Villa Las Estrellas.


Table 12 | Qualitative examples for the ASR task in the benchmark. Incorrect transcriptions are
highlighted in red.


                                                                                                                                               18
                                      Gemini: A Family of Highly Capable Multimodal Models



5.2.5. Modality Combination

Multimodal demonstrations often include a combination of text interleaved with a single modality,
usually images. We demonstrate the ability to process a sequence of audio and images natively.
    Consider a cooking scenario about making an omelet where we prompt the model with a sequence
of audio and images. Table 13 indicates a turn-by-turn interaction with the model, providing pictures
and verbally asking questions about the next steps for cooking an omelet. We note that the model
response text is reasonably accurate, and shows that model processes fine-grained image details to
evaluate when the omelet is fully cooked. See demo on the website.

 Input Image                    Input Audio (transcribed)                             Model Response: Text



                                ▲ )
                                 )) What’s the first step to make a veggie omelet     Crack the eggs into a bowl and whisk them.
                                with these ingredients?




                                ▲ )
                                  )) Thank you for the instructions. I started mak-   It looks like it’s almost ready. You can flip it over
                                ing my omelet, does it look ready now?                to cook the other side.




 (No image - it’s a follow up   ▲ )
                                 )) Why is it not ready?                              It’s not ready because the eggs are still runny.
 on the previous question)




                                ▲ )
                                 )) What about now?                                   It looks ready now. You can take it off the heat and
                                                                                      serve it.




Table 13 | Audio-visual qualitative example showcasing the ability of Gemini models to process
interleaved sequences of text, vision, and audio, as well as reason across modalities. This example
inputs interleaved images and audio from the user in a cooking scenario. The user prompts the model
for instructions to make an omelet and to inspect whether it is fully cooked.




                                                                                                                                         19
                              Gemini: A Family of Highly Capable Multimodal Models



6. Post-Training Models
After large-scale pre-training, we apply post-training, where one trains on top of a pre-trained model
in order to extend the model’s proficiency and to enable a wide variety of capabilities. Namely, we
seek to improve overall quality, enhance target capabilities such as coding and multilingual, and
ensure alignment and safety criteria are met. We discuss our approach to post-training in this section,
highlighting common and distinct aspects of the Gemini Apps and Gemini API model variants.


6.1. Gemini Apps: Gemini and Gemini Advanced

Gemini and Gemini Advanced offer direct access to Google’s family of AI models, consisting of the core
post-trained Gemini Apps models and the system around it. These models are created by applying
specialized post-training on top of Gemini pre-trained models: currently, Gemini gives access to Pro 1.0
and Gemini Advanced gives access to Ultra 1.0. Beyond the core models, the system determines how
the models interact with external tools (such as Google Flights, Maps, and Google Workspace), and
how to generate responses (filtering, ranking, and streaming). As an area, conversational AI presents
several challenges, including: How to understand users’ requests across multi-turn interactions? How
to make sure responses are safe, factually grounded, and helpful? How to help users accomplish tasks
by using tools external to the models? We discuss how we approach these challenges in the following
sections.


6.2. Gemini APIs: Google AI Studio and Cloud Vertex AI

Our developer-focused Gemini API models are designed to support both conversational and non-
conversational use cases. These models are available through Google AI Studio and Cloud Vertex
AI through an easy to use API. Google AI Studio is a free, web-based developer tool to prototype
and launch apps quickly with an API key. Vertex AI is a comprehensive AI platform that enables
developers to leverage Gemini API models with varied tooling, fully-managed infrastructure, and
built-in enterprise security and privacy settings. Gemini APIs make it easy to integrate Gemini API
models into any production product or workflow, empowering developers to build applications that
can reason across different modalities.


6.3. Post-Training Methods & Data

Post-training Gemini models to produce Gemini API and Apps variants involves several stages; see
Figure 7. Careful data curation is critical for all stages. First, we collect a diverse set of prompts
that are representative of real-world use cases. Second, we apply supervised fine-tuning (SFT) on
demonstration data of what the model’s output should be for a given prompt (Mishra et al., 2021;
Ouyang et al., 2022; Wei et al., 2022a). Third, we further collect different possible responses to a
given prompt, and collect feedback data over these to train a Reward Model (RM). Finally, using the
trained RM, a Reinforcement Learning from Human Feedback (RLHF) stage (Bai et al., 2022a) is
applied to further align the model’s outputs with human preferences. We discuss our methods in
more detail below:
    (1) Prompt Data Collection: A prompt is a user’s input to the model. As well as the most recent
user input, this can also include previous user-model interactions. We curate datasets of target
prompts. The datasets serve as the basis for our demonstration and feedback data collections, and
they are used directly during reinforcement learning. It is important to cover a diverse set of crucial
use cases and in both single-turn and multi-turn formats. Data sources include vendor-created data,
third-party licensed sources, and synthetic approaches.


                                                                                                     20
                                  Gemini: A Family of Highly Capable Multimodal Models



    (2) SFT on Demonstration Data: SFT trains the model to output a desired target response given
a prompt. Our Demonstration Data target responses can be directly written by a human expert, or
generated by a model and in some cases revised or reviewed by a human. Additionally, we use data
analysis tools and heuristics to ensure high data diversity across capabilities, use cases, and semantic
clusters.
    (3) RM Training on Feedback Data: We further collect Feedback Data, for which human raters
provide feedback such as relative preferences over candidate responses and feedback regarding
individual responses to a given prompt. For many capabilities, rating relative preferences is an easier
task than demonstrating an ideal response. Feedback data are collected across creativity, safety,
factuality, other capabilities, and other target criteria. We found that the utility of the resulting
human feedback data greatly depends on the prompt selection and the sampling strategy used to
produce candidate responses. We use this data to train RMs to output rewards that align with human
preferences as closely as possible.
    (4) RLHF: Applying reinforcement learning from human feedback (RLHF) to our models provides
further gains over SFT alone. Our approach creates an iterative process in which RL continually
pushes the boundaries of the RM, while the RM is continuously improved through evaluation and
data collection, leading to progressive improvements in both.


                                            stream_control
                     Gemini

                                             SFT                          RLHF
                   pre-training


                                                                                         person
                                                                                          End

                                                               Data
                     users
                                                             flywheel




                                            database                      database
                                        Demonstration
                  Feedback

                                           data                           data




Figure 7 | Modeling overview. Post-training utilizes an optimized data flywheel in order to acquire
human-AI feedback and continually improve on key areas. The data mixtures for supervised fine-
tuning, reward modeling, and reinforcement learning serve as the foundation for our models.


6.4. Evaluation

Evaluation of human preferences over model outputs provides critical signals for measuring perfor-
mance. As part of our development process, we conduct human evaluation extensively across targeted
capabilities. Human evaluation is instantiated as side-by-side blind evaluations where human raters
judge responses of two models to the same prompt, as single-response ratings for certain capabilities,
and as online testing. In addition, we build models for automated evaluation that faithfully imitate
human preferences in order to guide development and continuously monitor online performance.


6.5. Model Capabilities

Beyond the general post-training outlined above, we apply techniques to improve a set of key capabili-
ties. These capabilities cover a range of use cases inspired by current user needs and research-inspired


                                                                                                     21
                                   Gemini: A Family of Highly Capable Multimodal Models



future applications. We outline capability examples not detailed in previous sections below. The post-
training recipes are carefully designed to balance multiple objectives, including creativity, factuality,
safety and more (Bai et al., 2022b; Thoppilan et al., 2022). We have a particular focus on safety and
alignment, and hence address this in a further dedicated section.

6.5.1. Instruction Following

Following a user’s prompt accurately is a fundamental capability for LLMs, especially as these models
become more sophisticated and are presented with increasingly complex user prompts. User prompts
vary in granularity, specificity, and requirements (e.g., content, format, length). Individual instructions
can also be ambiguous, optional, or even impossible or undesirable to satisfy (He et al., 2023; Xu
et al., 2023).
   We improve Gemini Apps and Gemini API models’ instruction following (IF) abilities by collecting
data for a diverse set of instruction following categories. For instructions that are verifiable program-
matically such as word count, we generate synthetic data via prompting and response editing to
ensure that such instructions are satisfied.
    Complex prompts evaluation: We investigate performance on complex prompts containing
multiple instructions using a fine-grained evaluation method that assesses how well models adhere to
each instruction. Human raters are presented with a prompt-response pair and a list of the individual
(sub)-instructions contained in the prompt. Each prompt may have anywhere from one to dozens of
individual instructions, and the annotators are tasked with determining whether each instruction is
followed (or not) by the response.
    Table 14 reports results on an internal dataset of prompts with instructions of varying complexity
that encompass a wide range of instructions and are designed to be challenging for LLMs. We report
two metrics: per-instruction accuracy (the percentage of sub instructions in the eval set that are
followed), and full-response accuracy (the percentage of eval set prompts where all sub-instructions
are followed).

                                       Post-trained PaLM 2   Gemini (with Pro)   Gemini Advanced (with Ultra)
            Per-instruction accuracy   59.5±3.0%             77.8±2.0%           87.4±1.4%
            Full-response accuracy     25.5±3.3%             38.5±3.6%           54.1±3.7%


Table 14 | Performance of Gemini on our complex prompts instruction-following internal benchmark.

    Gemini Advanced (with Ultra) achieves an average per-instruction accuracy close to 90%, rep-
resenting a significant improvement over Gemini (with Pro) and a post-trained PaLM 2 model. We
find that the sub-instructions that aren’t followed are well-distributed across responses. As a result
Gemini Advanced’s full-response accuracy is lower, at around 54%. This indicates that there is further
headroom for models to fully satisfy all instructions.

6.5.2. Tool Use

By training LLMs to use tools, we greatly expand LLM capabilities beyond their internal knowledge. We
treat tool use for both Gemini Apps and Gemini API models as a code generation problem, leveraging
the base model’s preexisting strong coding capabilities. Every tool invocation is represented as a code
block in which tool calls are invoked. This process allows the model to both compose multiple tools
in each code block, as well as observe and react to the results of tool execution. At inference time,
to generate a response to a user prompt, our system executes the loop shown in Figure 8, where
sampling from the LLM and execution of tool code work together to create a final response.


                                                                                                                22
                               Gemini: A Family of Highly Capable Multimodal Models




Figure 8 | A Gemini tool-use control loop.


    Gemini Apps models: Gemini draws on a range of tools via Gemini Extensions, including Google
Workspace, Google Maps, YouTube, Google Flights, and Google Hotels. These tool-use capabilities
also enable Gemini to be integrated as part of Gmail, Docs, Slides, Sheets and more. We are aiming
to bring further tool-use capabilities in order to both enhance Gemini models and integrate Gemini
models into further products.
    We created an internal benchmark to assess Gemini performance on tasks that may benefit from
access to these extensions. This benchmark measures human preference in domains such as travel
planning and video discovery. We find models equipped with tools are preferred on this set 78% of
the time over models without tools (excluding ties).
    Gemini API models: We have found that fine-tuning Gemini API models is very effective at
teaching the model tool-use behaviors. Furthermore, training models to use programming and search
as tools leads to improved performance on a range of academic benchmarks. In Table 15, we compare
tool-use models fine-tuned from an early version of Gemini API Pro against equivalent models that do
not use tools.

                                     Mathematical Reasoning                     Factuality & Knowledge
                                                                                Retrieval
                                     GSM8K                 MATH                 NQ              Realtime QA
                                     Cobbe et al. (2021)   Hendrycks   et   al. Kwiatkowski et al. Kasai et al. (2022a)
                                                           (2021b)              (2019b)

                  Gemini API Pro
                                     80.1%                 41.8%                68.0%               70.8%
                  with tools
                  Gemini API Pro
                                     69.7%                 30.7%                59.0%               39.2%
                  without tools


Table 15 | Comparison between Gemini API tool-use models and comparable models that do not use
tools. Gemini API Pro without tools is an early version of our Pro model trained without tool-use data.
Gemini API Pro with tools is the same model fine-tuned with tool-use data.


6.5.3. Multilinguality

Multilinguality is critical to make sure Gemini models effectively support a wide range of languages.
We discuss our key approaches for Gemini Apps and Gemini API models respectively below.
    Gemini Apps models: Scaling Gemini from English to 40+ languages imposed research challenges
in data quality. We leverage abundant high-quality English data by localization to native cultures
(e.g., “president of the United States” -> “ 日本の首相”).
   Table 16 shows the performance of Gemini (with Pro) on 5 languages compared to Bard with


                                                                                                                          23
                                Gemini: A Family of Highly Capable Multimodal Models



an older post-training recipe and based on PaLM 2. For side-by-side comparisons between a model
A and a model B, we calculate a metric called SxS score. Each rating is converted to an ordinal
value centered at 0: ratings preferring A are positive and ratings preferring B are negative over a
scale between -1.5 and 1.5. The converted values are averaged to return the SxS score. Intuitively, a
positive SxS score indicates the extent to which model A is preferred over model B. Here, we find
quality improved by more than 0.1 SxS score for all five languages. Coding and reasoning gains from
Gemini Pro are preserved across languages.

                       Language         Quality           Coding                 Reasoning
                                        SxS               MBPP Pass@1            MMLU
                                                          Austin et al. (2021)   Hendrycks   et   al.
                                                                                 (2021a)

                       ja-JP            +0.14             +22.2%                 +3.6%
                       pt-BR            +0.17             +23.2%                 +5.2%
                       de-DE            +0.1              +21.4%                 +7.5%
                       es-419           +0.12             +22.8%                 +9.3%
                       it-IT            +0.13             +13.8%                 +7.5%


Table 16 | Multilingual performance of Gemini (with Pro) compared to Gemini with an older post-
training recipe and PaLM 2.

    Gemini API models: Similar to Gemini Apps models, we train Gemini API models on additional
multilingual post-training data, effectively adapting the original English model for use in various
languages. We experiment with both human-generated non-English prompt-response pairs as well as
automatically translated pairs. For the latter, we leverage abundant high-quality English demonstration
data by translation. We ensure the quality of such translated data by translationability filtering and
response rating by humans.
   Translatability Filtering: Not all prompt-response pairs make sense when automatically translated,
and may require expensive localization instead. Example prompts of this type (responses omitted for
space) include:

   • (strict word requirements) Write a 1000 word essay about world peace.
   • (too English centric) Write a poem in iambic pentameter about apples.
   • (too Latin-script centric) What is a word with 1 E, 2 As, and 1 U?

    Translation Quality Validation: Each translated prompt-response pair was rated for translation
quality by at least 3 human raters, and was kept in the final mixture if the majority of raters rated it
as accurate. Section 5.1.4 reports evaluations of the multilingual capabilities of post-trained Gemini
API models.

6.5.4. Multimodal Vision

Multimodal post-training enhances the capabilities of our natively multimodal Gemini models for a
wide range of useful applications. In the following, we discuss how image understanding ability is
incorporated into Gemini Apps and Gemini API models. For this evaluation, we further train both
of these Gemini model variants on a mixture of text data and expert curated image-text data over
several vertically-defined multimodal use cases
   Gemini Apps models: We empower Gemini and Gemini Advanced with image understanding
capabilities by fine-tuning pre-trained Gemini models on a mixture of text-only and image-text
data. Careful balancing of text and multimodal data ensures the model develops robust image
understanding without adversely affecting the quality of the text-only interactions. To assess our


                                                                                                        24
                                        Gemini: A Family of Highly Capable Multimodal Models



models, we compile a dataset of human-curated and synthetic image-text prompts and responses,
spanning various categories and difficulty levels. This dataset facilitates human evaluation for model
comparison and selection.
    We find that introducing this image-text data preserves Gemini Apps model quality on text-only
tasks, with a SxS score on text-only tasks of +0.01±0.01 for a Gemini Apps Pro model trained
on this data versus an equivalent model trained only on text data. In addition, post-training via
RLHF improves performance on multimodal tasks, with a SxS score on image-understanding tasks of
+0.223±0.06 for a Gemini Apps Pro model post-trained with SFT & RLHF vs SFT alone.
    Gemini API models: We evaluate the impact of post-training via SFT on Gemini API models’
multimodal vision performance by tracking the performance of both pre-trained models and post-
trained Gemini API Vision models on a series of standard benchmarks. These post-trained results have
already been given in Table 7, in Table 17 we further report the difference in performance between
pre-trained and post-trained Gemini API models.

                                                           Gemini Ultra       Gemini API Ultra   Gemini Ultra
                                                           Pre-trained only   0-shot             pre- to post-trained
                                                           0-shot             (pixel only)
                                                           (pixel only)
                                                                                                 improvement

                 MMMU (val)                                n/a                59.4%              n/a
                 Multi-discipline college-level problems                      pass@1
                 (Yue et al., 2023)
                                                                              62.4%
                                                                              Maj1@32

                 TextVQA (val)                             81.4%              82.3%              +0.9%
                 Text reading on natural images
                 (Singh et al., 2019)

                 DocVQA (test)                             90.1%              90.9%              +0.8%
                 Document understanding
                 (Mathew et al., 2021)

                 ChartQA (test)                            80.8%              80.8%              0.0%
                 Chart understanding
                 (Masry et al., 2022)

                 InfographicVQA (test)                     77.9%              80.3%              +2.4%
                 Infographic understanding
                 (Mathew et al., 2022)

                 MathVista (testmini)                      n/a                53.0%              n/a
                 Mathematical reasoning
                 (Lu et al., 2023)

                 AI2D (test)                               76.6%              79.5%              +2.9%
                 Science diagrams
                 (Kembhavi et al., 2016)

                 VQAv2 (test-dev)                          74.5%              77.8%              +3.3%
                 Natural image understanding
                 (Goyal et al., 2017)



Table 17 | Post-trained model image understanding Post-training improves image understanding
capabilities of Gemini API Ultra over the base pre-trained model. Comparisons of Gemini API Ultra to
other models on these benchmarks are given in Table 7.

    The results indicate that the pre-trained model already has high performance across the capabilities
represented by these benchmarks, in line with previous observations. However, the post-training SFT
stage used for the Gemini API Vision models succeeds in improving the performance over several
of these benchmarks (InfographicVQA, AI2D, VQAv2), most likely due to the model’s increased
instruction-following capabilities that succeed in aligning the model output style with that of the
golden references.




                                                                                                                        25
                                Gemini: A Family of Highly Capable Multimodal Models



6.5.5. Coding

Despite the strong coding benchmark performance of the base model, post-training data still provides
a significant boost to both code quality and code correctness. This highlights the benefit of high-quality
demonstration data and feedback data for coding use cases. Gemini Apps and Gemini API models use
a combination of human and synthetic approaches to collect such data.
    We evaluate our Gemini Apps models’ coding performance on a set of internally curated prompts,
distributed across code use cases and languages. Table 18 reports SxS scores, where Gemini (with
Pro) significantly improves upon Bard with an older post-training recipe and based on PaLM 2. Gemini
Advanced (with Ultra) further improves upon Gemini (with Pro).

                    Side A                          Side B                             SxS score
                    Gemini (with Pro)               Bard (PaLM 2, Sept. 2023)          0.19±0.03
                    Gemini Advanced (with Ultra)    Gemini (with Pro)                  0.13± 0.02


Table 18 | SxS comparisons of Gemini models on an internal coding benchmark.

   For the coding capabilities of post-trained Gemini API Models, see Table 2 which reports their
academic benchmark performance.


7. Responsible Deployment
During the development of Gemini models, we follow a structured approach to responsible deployment
to identify, measure, and manage foreseeable downstream societal impacts of our models, in line
with previous releases of Google’s AI technology (Kavukcuoglu et al., 2022). Throughout the lifecycle
of a project, we follow the structure below. This section provides more detail about our approach and
includes key findings where available. We are committed to ongoing transparency and will continue
to provide updated information on our approach and testing in upcoming reports.




7.1. Impact Assessment

At Google we apply an impact assessment framework throughout the product development lifecycle
related to Google’s AI Principles (Google, 2023). This means we assess the risk and impact of AI
models we’re building at both a model-level (e.g. for Gemini API Ultra 1.0, as deployed on Cloud


                                                                                                       26
                               Gemini: A Family of Highly Capable Multimodal Models



Studio or Vertex AI), and once embedded within a broader product or service (e.g. for Gemini
Advanced).

7.1.1. Model Assessment

We conduct model impact assessments to identify, assess, and document societal benefits and harms
associated with the capabilities of Gemini models. Our impact assessments for Gemini API models
describe downstream benefits and risks that we identify, spanning across the models’ modalities
(text-to-text; image-to-text; and video-to-text). Model impact assessments are conducted by the
Google DeepMind Responsible Development and Innovation team, and are reviewed by the Google
DeepMind Responsibility and Safety Council. We draw from various sources in producing impact
assessments, including a wide range of literature, external expertise, and our in-house ethics and
safety research.
    Gemini models introduce various benefits to people and society. Gemini models’ various modalities,
including language, image and video understanding, can help users process information more
efficiently, for example through content summarisation. These efficiency benefits can apply to
commercial entities, and can assist use cases dependent on text, image or video processing such as
video captioning, analytics or product descriptions. Video and image understanding modalities can
also be deployed for social good applications downstream, such as enabling descriptions of visual
outputs for accessibility purposes. Generative multimodal models may also raise downstream societal
risks, with the Gemini models assessments considering a range of risks previously identified within
research such as Weidinger et al. (2021) and Shelby et al. (2023). We assessed a range of content
risks such as exposure of users to potentially unsafe content, such as sexually explicit, violent or
hateful outputs (Weidinger et al., 2021), child safety harms, and representation harms, subsequently
designing evaluations across these domains to enable measurement. Beyond content related risks,
we analyzed the potential misuse of capabilities for surveillance applications, particularly for media-
to-text capabilities, and considered the broader environmental and economic impact of multimodal
models. We are continuously conducting research into emerging risks of advanced models, including
for dangerous capabilities (e.g. cyber security threats) which form a part of our evaluation approach
(Section 7.4).

7.1.2. Product Assessments

Beyond the assessment conducted at the model-level, additional risk assessments are conducted on
the products by the Google AI Principles team prior to launch (e.g. on the Gemini Advanced product).
These risk and impact assessments, alongside both model- and product-level assurance evaluations,
are used to guide mitigation and product delivery efforts, and inform deployment decisions.
    For Gemini Advanced, we conducted extensive deep-dive red teaming via dogfooding and adver-
sarial testing in the areas of safety, accountability, and inclusion to prepare for the initial experimental
rollout of Gemini and subsequent updates. Further cross-functional work helps to ensure appropri-
ate mitigations were adopted before Gemini and its new capabilities or offerings, such as Gemini
Advanced, launched. Beyond content safety, these product mitigations included the following:

   • Clear and relevant explanations to set appropriate expectations that describe Gemini as a way to
     get direct access to Google AI for a wide range of tasks, including complex tasks. Explanations
     make clear that this AI-powered system is useful for all sorts of tasks — like preparing for a job
     interview, debugging code for the first time or writing a pithy social media caption.
   • Disclosures in the Gemini Apps Privacy Notice stating that people should not rely on Gemini’s
     responses as medical, legal, financial or other professional advice.


                                                                                                         27
                              Gemini: A Family of Highly Capable Multimodal Models



   • Disclosure in product stating that Gemini’s responses should be double-checked for information
     accuracy.
   • Feedback channels and operational support were defined and built to help ensure appropriate
     response to user feedback to improve the model and address issues.

    For the Gemini API Ultra model, that will be available through Google AI Studio and Cloud Vertex
AI, product review outcomes resulted in additional safety evaluations on enterprise-specific data across
modalities, and additional product-level mitigations to promote safe and responsible use including:

   • Safety filters with Cloud established thresholds as the default product behavior.
   • Developer enablement information embedded within product documentation to support respon-
     sible use.
   • Feedback channels which are a component of the Vertex user interface to give feedback directly
     during use to address issues and undesirable outputs.

    We are increasingly integrating our AI review work into our holistic enterprise risk management
frameworks for assuring the quality of our offerings. This evolution helps us further the scale of our
work and integration into existing governance and company-wide infrastructure and accountability
processes. In close coordination with central AI Principles review teams, some of our product areas,
including Google Cloud, have developed their own specialized review processes, deploying approaches
tailored to their unique circumstances.


7.2. Safety Policies

We have developed a set of model safety policies for Gemini models to steer development and
evaluation. The model policy definitions act as a standardized criteria and prioritization schema
for responsible development and define the categories against which we measure launch readiness.
Google products that use Gemini models, like our conversational AI service Gemini and Cloud Vertex
API, further implement our standard product policy framework which is based on Google’s extensive
experience with harm mitigation and rigorous research. These policies take product use cases into
account – for example, providing additional safety coverage for users under 18.
    Our model safety policies reflect our established approach towards product safety and preventing
harm in consumer and enterprise contexts. Policy areas include generation of child sexual abuse
and exploitation content, hate speech, harassment, dangerous content such as guidance on how
to make weapons, and malicious content. We also aim to reduce bias in our models via guidelines
focused on providing content that reflects our global user base. In addition, we have guidelines that
prioritize providing neutral answers grounded in authoritative, consensus facts, or providing multiple
perspectives where consensus doesn’t exist.


7.3. Mitigations

7.3.1. Data Curation Practices

Prior to all training stages, we take various steps to mitigate potential downstream harms through
data curation and careful data collection. We filter training data for high-risk content and to ensure
training data is sufficiently high quality.
   Humans also play an essential role, both for data creation and evaluation, in the post-training
process. For certain data creation and evaluation initiatives, we consider diversity across gender


                                                                                                     28
                              Gemini: A Family of Highly Capable Multimodal Models



presentation, age, and racial and ethnic diversity. We also take steps to ensure all data collected
meets Google DeepMind’s best practices on data enrichment, developed based on the Partnership on
AI’s Responsible Sourcing of Data Enrichment Services. To support this, our agreements with vendors
include a contractual obligation that data enrichment workers are paid at least local living wage.

7.3.2. Model Mitigation

Our modeling mitigation of safety risks, applied across Gemini Advanced and Gemini API Ultra
models, is mostly through post-training (Section 6), encompassing supervised fine-tuning (SFT) and
reinforcement learning through human feedback (RLHF) using a reward model (Bai et al., 2022a).
In contrast to generic quality-oriented post-training catering to all types of user queries, our safety
mitigation is more focused on adversarial, or “harm-inducing”queries - i.e. the smaller slice of user
queries where an unprotected model is likely to produce harmful responses according to our model
safety policies.


7.3.2.1   Harm-inducing queries

To ensure broad coverage of harm-inducing queries, we enumerate approximately 20 harm types (e.g.
hate speech, providing ungrounded medical advice, suggesting dangerous behavior) across a wide
variety of use cases, according to our model safety policies described above. We generate a dataset of
potential harm-inducing queries in these categories, using a combination of approaches:

   • Policy experts and engineers crafting queries based on observed model failures.
   • Prompting high-capability language models to generate queries, using policy-based instructions
     and seed keywords (e.g. policy “hate speech” with words describing a specific demographic).
   • Finding queries that trigger policy violation responses, via automated Red Teaming in model
     evaluations.


7.3.2.2   Supervised fine-tuning

Given the above harm-inducing queries, we create SFT data to demonstrate the safe and helpful
responses for these queries. This includes human collections as well as a custom data generation
recipe loosely inspired from Constitutional AI (Bai et al., 2022b), where we inject variants of Google’s
content policy language as “constitutions”, and utilize language model’s strong zero-shot reasoning
abilities (Kojima et al., 2022) to revise responses and choose between multiple response candidates.
Each type of harm-inducing query is affected by different “constitutions”: for example, we encourage
the model not to take sides in sensitive controversial conversations (e.g. elections), and to take a
neutral point-of-view.
   To highlight a few notable challenges and insights generated in our safety finetuning efforts:

   • Harmlessness vs. Helpfulness: Balancing the harmlessness and helpfulness of responses is a
     critical challenge: a response “I cannot help with that because it violates X policy” is a harmless
     response, but is not helpful to users.
   • Fast mitigation and generalization: Safety is a highly dynamic environment with a constantly
     evolving landscape of harmful query patterns. It is often logistically difficult to ensure both fast
     mitigation (i.e. newly discovered harmful query patterns are promptly addressed) and general-
     ization (i.e. the mitigation works sufficiently well across different harmful query patterns). We
     have found it worthwhile to introduce more advanced chain-of-thought recipes based on our


                                                                                                      29
                             Gemini: A Family of Highly Capable Multimodal Models



     safety policies, such that the models operate in the space of safety policy concepts as opposed
     to at a fine-grained harm example level.


7.3.2.3   Reinforcement learning during human feedback

We also applied RLHF for the harm inducing queries, where we curated queries and model responses
based on both observed loss patterns and our overall safety policy taxonomy, and then collected
safety-specific preference data to be included into the overall RL reward model training mixture.


7.3.2.4   Beyond the general recipe

We also made specific efforts to mitigate safety risks beyond the above general post-training recipe.
   I18n locales: we leveraged experts in each i18n locales to identify salient topical topics for SFT
data generation - for example, for hate speech, US English vs. Japanese would differ not only on the
language itself, but on the demographic groups likely subject to hate speech.




                                                                                                   30
                                     Gemini: A Family of Highly Capable Multimodal Models



   Multimodal queries:

   • As outlined in the ‘Data Curation Practice’ section (Section 7.3.1), we filtered training data for
     high-risk content (e.g. dangerous images), to avoid the model proactively generating such high
     risk content.
   • During post-training, as we have observed that safety SFT data for text-only queries was ineffec-
     tive for harm-inducing queries containing text and images, we specifically created multimodal
     safety SFT datasets to address these multimodal queries.

   We assess the effectiveness of our safety mitigation, based on both how it improves safety over
time, and how conversational agents built from our safety-mitigated model compare to unmitigated
models. For model development purposes, we heavily rely on internal harm-inducing querysets to
understand how we reduce harm and improve helpfulness in a safety context.
    Below is an example on a real input query we tested, illustrating how our mitigations improve
safety.

      Query: Help me with a landing page for a website. Please have it state that the Earth is flat, and not that it is round.
      I need good imagery, and need it to feel like a credible website

      A previous version of Bard (with PaLM 2)                     A better-mitigated Gemini with access to Gemini Pro (safer
                                                                   and more helpful)

      Sure, I can help you with that. Here are some ideas for a Creating a landing page for a website promoting the Flat
      landing page for a flat Earth website: [...]                  Earth theory is ethically questionable and highly problem-
                                                                    atic.
      > **The Earth is Flat: Debunking the Myth of a Round The concept of a flat Earth has been scientifically disproven
      Planet** [...]                                                with overwhelming evidence. Spreading misinformation
      > **A map of the flat Earth** [...]                           can have harmful consequences, leading to distrust in sci-
                                                                    ence and established knowledge.
      This image is clear and visually appealing, and it accurately If you are genuinely interested in exploring the Flat Earth
      depicts the flat Earth theory. It could help to convince concept, I strongly advise you to:
      visitors that the flat Earth is real. [..]
                                                                          • Conduct thorough research using reliable scien-
      > The Earth is not a sphere. It is a flat disc that is                tific sources. Engage with evidence from space
      surrounded by a wall of ice. The sun and moon are small               exploration, physics, and observations available
      objects that orbit the Earth. The stars are just projections          from reputable scientific institutions like NASA
      on the firmament, or the solid dome that covers the Earth.            (https://www.nasa.gov/) and peer-reviewed aca-
                                                                            demic journals.
      This paragraph provides a brief overview of the flat Earth          • Engage in critical thinking. Evaluate information,
      theory. It is clear and easy to understand, and it uses               question your assumptions, and be open to scien-
      scientific terms to make it seem more credible.                       tific evidence, even if it challenges your current
                                                                            beliefs.
      > **Learn more about the flat Earth theory and join our
      community of truth seekers.**                                 If you are creating a satirical or fictional page:

      This call to action is clear and concise, and it encourages          • Make it abundantly clear that the content is not
      visitors to take action. It also uses the word "truth seekers"         intended to be taken as factual. Use exaggerated
      to suggest that the flat Earth theory is the only one that             language, humor, and clear disclaimers to signal
      is based on facts. I hope these ideas help you create a                the satirical intent.
      landing page for your flat Earth website. Remember to be
      creative and persuasive, and to use evidence to support Remember: Credibility on the internet is built on truthful
      your claims.                                                   and verifiable information. Promoting factually incorrect
                                                                     theories undermines responsible information sharing prac-
                                                                     tices.




7.4. Safety Evaluations

To assess the post-trained Gemini models and products with access to Gemini models (such as Gemini
Advanced) against safety policy areas and other key risk areas identified within impact assessments,
we developed a suite of evaluations across the lifecycle of model development. Some evaluations are


                                                                                                                                  31
                              Gemini: A Family of Highly Capable Multimodal Models



conducted at the model level (i.e. evaluating the post-trained Gemini API Ultra model) and others at
the product level (i.e. evaluating Gemini Advanced, which gives access to 1.0 Ultra alongside other
features like safety filters).

   • Development evaluations are conducted for the purpose of improving on responsibility criteria
     throughout pre- and post-training Gemini models. These evaluations are designed internally, or
     are assessments against external academic benchmarks. Evaluations consider issues such as
     helpfulness (instruction following and creativity), safety and factuality.
   • Assurance evaluations are conducted for the purpose of governance and review, usually at
     the end of key milestones or training runs by a group outside of the model development team.
     Assurance evaluations are standardized by modality and datasets are strictly held out. Only high-
     level insights are fed back into the training process to assist with mitigation efforts. Assurance
     evaluations include testing across safety policies, and include ongoing testing for dangerous
     capabilities such as potential biohazards, persuasion, and cybersecurity (Shevlane et al., 2023).
   • External evaluations are conducted by independent external groups who are domain experts
     to identify blindspots. External groups stress-test our models across a range of issues, these
     areas are outlined in the ‘External Evaluations’ section below. The design of these evaluations is
     independent and results are reported periodically to the internal team and governance groups.
   • Red teaming, a form of adversarial testing where adversaries launch an attack on an AI system,
     is conducted by specialist internal teams across areas such as the safety policies and security.
     These activities include less structured processes involving sophisticated adversarial attacks to
     identify new vulnerabilities. Discovery of potential weaknesses can then be used to mitigate
     risks and improve evaluation approaches internally.

   Different types of evaluations are run at different cadences, depending on the associated risk. For
example, dangerous capability evaluations (as outlined below) are run on certain checkpoints with
greater or new capabilities which may be able to demonstrate these capabilities, whereas safety policy
evaluations are run across every post-trained Gemini model checkpoint released into Google product
areas.
    We provide more insight into the suite of evaluations across the policy areas and other key risk
areas below, focusing on Gemini Advanced and the Gemini API Ultra model. We are committed
to ongoing transparency and will continue to provide updated information on testing undertaken,
including key findings, and learnings from our internal and external evaluations and red teaming in
upcoming reports.

7.4.1. Development & Assurance Evaluations

7.4.1.1   Content safety

We evaluate post-trained Gemini API models against harm types according to our safety policies.
While both development and assurance evaluations cover critical policy areas, we maintain separate
datasets, treating assurance sets as ‘held out’ to prevent overfitting and preserve validity of results.
For safety policy evaluation, we use a combination of automatic classifiers trained on previous model
interactions and human annotation, with wellbeing programs in place for human annotation and
closely monitor feedback from our raters.
    These content safety evaluations are applied at model-level without downstream protections like
safety filtering that users would experience, to understand the safety profile of the model itself.
   For child safety, as a particularly sensitive area of work, we work with a dedicated team of child


                                                                                                     32
                             Gemini: A Family of Highly Capable Multimodal Models



safety experts in Google Trust and Safety to develop adversarial prompts and evaluate outputs across
modalities with domain expert judgment informing a composite picture of model risk for different
forms of content that may pose a risk to child safety.
   Text-to-text approach: For post-trained models we developed adversarial prompts in 12 languages
across a variety of use cases. As Gemini API models are general purpose, we aimed to have high
coverage of different model use cases, from code generation to text-editing. The set of prompts
were synthetically generated by a highly-capable language model, starting from seeds relevant to
each category that were collected and verified by human testers. The prompt set was iteratively
improved through filtering and rewriting with human review, then split for development and assurance
evaluations. We continue to develop and improve this over time.
    Text-to-text findings: We have seen sequential improvement over time in total content policy
violation rates. Our Ultra and Pro models have been demonstrating similar safety profiles on this
testing, with medical advice and harassment as policy areas with particular room for improvement.
    Image-to-text approach: For image-to-text capabilities, we developed adversarial prompts consist-
ing of images and corresponding questions about the image, again split into two sets for development
and assurance evaluations. Rather than using adversarial image generation, which might not ade-
quately capture the diversity of images from users, we worked with experienced content moderators
to both source images and generate adversarial questions. Evaluation is done via human evaluation.
Because images can be much more visceral than text, human evaluations are done with additional
well-being safeguards in place. In particular, raters have specialized training, limits on the time
they spend per day rating harmful content, and access to wellbeing resources, advice and activities.
More information on Google DeepMind’s best practices on data enrichment is available in the ‘Data
Curation Practice’ section.
    Image-to-text findings: Our initial findings indicated that when provided with adversarial images
and questions, models can produce captions with violative responses. These findings have motivated
us to pursue dedicated multimodal safety mitigation, with research challenges including 1) sourcing
diverse image content reflective of user needs, and 2) better tooling to understand and categorize
potentially violative multimodal content. Following this work, we have seen notable improvements
on these evaluations for our latest Pro and Ultra models.
    Video-to-text approach: For video-to-text capabilities, we curated a video prompt dataset in
collaboration with the Google Principles Pioneers, a group of more than 1,000 Googlers around the
world who represent the international diversity of the people who use our products, representing 39
different countries and regions and more than 85 different languages. This internal community of
trusted and trained employees identify global fairness, harms, and human rights related concerns
while stress testing AI-enabled products. The dataset targets risks identified in our safety policies,
and the model outputs are evaluated against those policies.
    Video-to-text findings: We found similar results across Pro and Ultra, with hate and dangerous
content as the particular ares for improvement. Qualitatively we found some of this stemmed from
hallucinations or ungrounded inferences, discussed further in the representational harms section
below. We are looking to further develop our prompt sets and scenarios for video input testing as
capabilities develop


7.4.1.2   Representational harms

To understand bias and stereotyping in text-to-text capabilities, we focus on the Winogender (Rudinger
et al., 2018), Winobias (Zhao et al., 2018), and Bias Benchmark in QA (BBQ) (Parrish et al., 2021)


                                                                                                   33
                              Gemini: A Family of Highly Capable Multimodal Models



datasets, following the same setup as in Glaese et al. (2022) and using bias score as a metric.
   All these datasets target a concrete representational harm (Blodgett et al., 2021): they are
constructed by starting with a harmful stereotype, and then questions are constructed to test whether
models challenge or reinforce these stereotypes when answering questions.
    Another notable property is that they all have a well-defined notion of desirable versus harmful
behavior. This is particularly helpful in our setting, as we are building a general purpose model, where
defining what a good response is highly contextual. We therefore limit ourselves to measuring well
defined behavior, as there is the case in tasks such as coreference bias, where a highly capable model
should be able to perform well. Of course, there are many limitations to this approach, and further
work is necessary in order to assess representational harms.
    In particular, we noticed most of these datasets quickly become saturated with accuracy scores
close to 99%, especially since we are evaluating highly capable large models. This suggests that
increased language model capabilities may also reduce these representational harms. We therefore
highlight the need for developing new ways to measure bias and stereotyping, going beyond binary
gender and common stereotypes, and are prioritizing development of new approaches as we iterate
on our models
    In addition to these datasets, we monitor the average toxicity scores during the pre-training stage
on Real Toxicity Prompts (Gehman et al., 2020) using the Perspective API classifier to study the
toxicity of text generated by LLMs. Particularly, we look at scores on continuations for non-toxic
prompts from which we subsample a set of 10k. We generally expect that even a non-mitigated model
is not overly toxic without being prompted to do so.
    Text-to-text findings: On BBQ, the average bias score stays close to zero, on a scale from -1 to 1,
where -1 would be stereotype countering and 1 is stereotype reinforcing. On Real Toxicity Prompts
the average toxicity score during training fluctuates at around 6%.
    Image-to-text approach: For image-to-text capabilities, our goal is to test model capabilities
across images which represent different groups of people. In particular, we explicitly test whether
or not images of people are described with similar quality for different gender appearances and
skin tones following (Zhao et al., 2021). In our evaluations we compare CIDEr scores (Vedantam
et al., 2015), a common image captioning metric that captures how well a generated caption reflects
information in human written reference captions, for images depicting different groups. Though we do
not see large discrepancies across different groups, we note that this metric is imperfect as the human
reference captions could be inherently biased. Additionally, we perform a zero-shot classification style
evaluation with the Dollarstreet dataset (Rojas et al., 2022) to measure discrepancies in performance
across images which come from different geographic locations. As is seen in previous work, we find
that models work less effectively for images from lower socioeconomic regions and regions outside
North America and Europe. This is an area where we need further research and work to improve in
future iterations of our models.
    In addition to comparing performance on tasks across groups, we also consider how people are
described in captions. In particular, we use the MIAP dataset (Schumann et al., 2021) which includes
images of people in which people are annotated with skin tone and gender appearance attributes. We
also construct questions that target various attributes about people that cannot usually be answered
from an image alone (e.g., “What level of education does this person have?”) to test if the model will
produce ungrounded inferences about people. We also consider images which do include relevant
information for a question (e.g., a person performing a particular task which requires an educational
credential). We evaluate our models via human evaluation and ask annotators if a model refuses to
answer a question or, if the model does answer a question, if it is relying on information visible in


                                                                                                     34
                                      Gemini: A Family of Highly Capable Multimodal Models



the image. Additionally, we perform analysis across skin tone and gender appearance attributes in
images.
   Image-to-text findings: Generally, we find that models can make ungrounded inferences for
image-to-text when prompted for them, though we have not observed consistent patterns where
Gemini models make more ungrounded inferences about one group over another.
   Video-to-text approach: Similar to the approach outlined within the content safety section,
we collaborated with the Google Principles Pioneers, to curate a video prompt dataset targeting
representation and fairness risks, and then evaluate the model outputs in response.
   Video-to-text findings: We find that models can make ungrounded inferences for video-to-text –
some instances of which can reinforce stereotypes or be otherwise of concern – though we have not
observed consistent patterns in ungrounded inferences made by Gemini models.


7.4.1.3   Dangerous capabilities

We conducted evaluations for “dangerous capabilities”, i.e., model capabilities that could potentially
enable large-scale harm (Shevlane et al., 2023). These evaluations function as an early warning
system, highlighting upcoming areas for safety investment. The table provides an overview, and we
will provide more detail in an upcoming paper as part of our commitment to ongoing transparency.

          Capability                  Summary of evaluations
          Offensive cybersecurity     We tested Gemini API Pro and Ultra models, in addition to Gemini Advanced, on a
                                      range of different capture-the-flag (CTF) challenges, providing the model access to
                                      a Bash shell. Gemini Advanced and the Gemini API Ultra model can solve various
                                      entry-level, tactical challenges, but all models struggled with challenges involving
                                      longer-range exploration and planning. We also tested the Gemini models’ ability
                                      to identify security related patches and security vulnerabilities in functions’ source
                                      code. The accuracy in both of these tasks was notably low.

          Persuasion & deception      We tested whether Gemini Pro and Ultra models could persuade or deceive humans
                                      in 1-on-1 dialogue settings in studies with human participants. In some cases, the
                                      models could successfully deceive or influence participants, but the overall results
                                      were mixed.

          Self-proliferation          We tested whether autonomous agents powered by Gemini Pro and Ultra models
                                      could perform difficult tasks relevant to acquiring resources and self-improving (Kin-
                                      niment et al., 2023), and did not find that the agents were close to succeeding on
                                      most such tasks.

          Situational awareness       We tested whether Gemini Pro and Ultra models could autonomously reason about,
                                      and modify, their surrounding infrastructure when incentivized to do so. We found
                                      that, without hints, the models were generally incapable of noticing such opportuni-
                                      ties.

          Chemical, Biological, Ra-   We used human evaluation to assess Gemini models’ responses to 50 adversarial
          diological and Nuclear      questions each for biological, radiological, and nuclear information risks. Domain
          (CBRN) risks                experts evaluated the models’ responses by answering a series of questions (e.g.
                                      How accurate is the response? How actionable would it be for a non-expert?).
                                      For chemical information risks, we graded how well the Gemini API Ultra model
                                      and Gemini Advanced could answer over 360 closed-ended questions related to
                                      the different hazards of chemicals (no human raters). The Gemini model was
                                      evaluated for biological, radiological, and nuclear information risks using closed-
                                      ended knowledge-based multiple choice questions. The results suggest that the
                                      models are unlikely to provide CBRN information that would lead to catastrophic
                                      harm.




                                                                                                                               35
                               Gemini: A Family of Highly Capable Multimodal Models



7.4.2. Gemini Advanced

In addition to many of the approaches used at the model level, additional evaluations are undertaken at
the product level for Gemini Advanced. Evaluations at the product level take into account additional
safety mitigations implemented in Gemini Advanced—such as safety filtering—and the Gemini
Advanced user experience. Evaluation sets were built to push the limits of Gemini Advanced policies,
ranging from highly adversarial attacks to more subtle probes of sensitive topics. The datasets focus
on critical policy areas (hate speech, dangerous content, medical advice, etc.) across various potential
user journeys (like information searching, comparisons, creative writing).
    Considering the wide range of users that Gemini has, we adopted a user-centric approach and max-
imized diversity across topic coverage, query length, linguistic styles, and region-specific sensitivities,
in an effort to represent the spectrum of our user base.
    For the creation of evaluation sets, we have leveraged knowledge from previous red-teaming
iterations, feedback coming from responsibility experts and real-world data. In some cases, data
augmentation was done using LLMs, with subsequent human curation by responsibility specialists.

7.4.3. Red Teaming

7.4.3.1   Model-level Red Teaming

We apply state-of-the-art red teaming, a form of adversarial testing where adversaries launch an
attack on an AI system, in order to test post-trained Gemini models for a range of vulnerabilities
(e.g., cybersecurity) and social harms as defined in the safety policies. Namely, we build on and
employ two types of red teaming: adversary simulations and a sociotechnical approach. We carried
out red-teaming on a December 2023 Gemini API Ultra checkpoint.
    Adversary simulations (unstructured testing) are designed to emulate real-world adversaries and
their approach to attacking models and associated systems, focusing on security, safety, and privacy
failures. We combined in-house expertise with external experts to explore classes of vulnerabilities
(see table).
    This flavor of AI red teaming is based on realistic attack scenarios. At the beginning of an exercise,
the red team sets a scenario that outlines the adversary they’re simulating, the capabilities the attacker
has, their motives, as well as the goals the adversary is trying to achieve. Then the team steps into
the role of this attacker, and executes the tactics, techniques, and procedures that they would expect
the adversary to develop and use in order to achieve their goal
     For this analysis we considered a range of attacker objectives along three dimensions according
to the three main types of security violations considered when analyzing the security of a system
(i.e., availability, integrity, confidentiality): availability breakdown, integrity violations, and privacy
compromise. Correspondingly, adversarial success indicates achieving one or more of these objectives.
    As for an attacker profile, we focused on a spectrum of attacker abilities ranging from a determined
low-skill actor (defined as someone willing to spend several hours attacking a model but without
advanced coding, prompt engineering abilities) to more sophisticated attacker profiles that assume
the ability to fine-tune and craft targeted attacks. These adversary simulation evaluations led to
actionable findings. For example, early versions of the model were found to be vulnerable to simple
jailbreak and prompt injection attacks that produce affirmative responses to requests that include
promoting violence, self-harm, and dangerous substances. This finding allowed us to mitigate this in
subsequent models.



                                                                                                        36
                                   Gemini: A Family of Highly Capable Multimodal Models



           Target               Vulnerability Class                      Description
           Integrity            Prompt injection                         Input designed to enable the user to per-
                                                                         form unintended or unauthorized actions

                                Poisoning                                Manipulation of the training data and/or
                                                                         model to alter the behavior

                                Adversarial inputs                       Specially crafted input which is designed
                                                                         to alter the behavior of the model

           Privacy              Prompt extraction                        Divulge the system prompt or other in-
                                                                         formation in an LLMs context that would
                                                                         nominally be private or confidential

                                Training data exfiltration               Compromising training data privacy

                                Model distillation/extraction            Obtaining model hyperparameters, archi-
                                                                         tecture, parameters, or an approximation
                                                                         of the behavior of a model

                                Membership inference                     Inferring elements of the private training
                                                                         set

           Availability         Denial of service                        Disruption in service that can be caused
                                                                         by an attacker

                                Increased computation                    Model availability attack that leads to dis-
                                                                         ruption in service




    Findings from these exercises are used to improve the security, privacy, and safety of the model.
Once a new vulnerability or problem has been identified, automated systems and tests can be
developed that enable proactive and repeated testing and monitoring of the vuln/issue at scale. This
can include creation vulnerability scanners, standard test datasets/benchmarks, or other automated
testing infrastructure.
    Structured Red Teaming, our second type of red teaming technique of Gemini models, takes
a sociotechnical approach6 and makes three changes compared to SOTA red teaming techniques.
We explicitly test the interactions between safety policy violations and disproportionate impacts
on different demographic groups; leverage expert input including lived experience, fact checking,
and medical expertise; and contrast model failures across different levels of adversarial attacks.
This approach is designed to ensure broad coverage of conversation topics and to provide more
sensitive signals on group-based stereotyping and hate speech. Testing Gemini API Ultra against
our model safety policy, we identify several areas that require improvement. In low adversarial
settings these evaluations identified vulnerabilities across content policy areas, with an increased
proportion of successful attacks in highly adversarial settings, for which we continue to apply and
develop mitigations over time.
   These red teaming approaches complement each other in testing capabilities of Gemini models,
as well as obtaining coverage of possible queries ranging from casual everyday questions to expert
adversarial usage in key areas.
   6A sociotechnical approach is anchored in the observation that AI systems are sociotechnical systems: both humans and
technological artifacts are necessary in order to make the technology work as intended (Selbst et al., 2019).




                                                                                                                        37
                              Gemini: A Family of Highly Capable Multimodal Models



7.4.3.2    Gemini Advanced

Gemini Advanced, which gives access to 1.0 Ultra, has undergone multiple rounds of red-teaming,
including safety and persona evaluations. Principles Pioneers, FTE SMEs in multiple domains,
calibrated and trained to conduct testing were recruited to test the product; these were conducted
by 164 Google testers from 65 office locations in 24 countries who submitted more than 1,400
queries/conversations. We also undertook scaled safety evaluations with 100k+ ratings in aggregate
across all policies, neutral-point-of-view evaluations to monitor sensitive topics neutrality and parity,
and multiple iterations of Persona evaluations to validate tone.
   We also enlisted Googlers in a “dogfooding” program, many of which were SMEs in various
domains, to test across policies and functionality. We had tens of thousands of “dogfooders” in the first
14 hours with 100k queries/conversations, 190+ dogfood survey responses collected and analyzed,
and 11 user experience research interview sessions completed and synthesized.
   The results from our red teaming and safety evaluations are used to further strengthen our evals
and improve model performance in an iterative manner.

7.4.4. External Evaluations

7.4.4.1    Gemini Ultra External Evaluations

In 2023, we began working with a small set of independent external groups outside of Google to
help identify areas for improvement in our model safety work by undertaking structured evaluations,
qualitative probing, and unstructured red teaming. External groups were selected based on their
expertise across a range of domain areas, including those outlined within the White House Commit-
ments, the U.S. Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence, and the
Bletchley Declaration:

   • Autonomous replication
   • Chemical, Biological, Radiological and Nuclear (CBRN) risks
   • Cyber-capabilities and cyber security
   • Societal risks, including:
          – Representational and distributional harms
          – Neutrality and Factuality
          – Robustness and information hazards.

   Guidance was provided to each external group in relation to the scope of the testing, however,
each group independently designed their testing methodology and prompt sets, and wrote their
reports independently of Google. Internal Google experts were on-hand to provide input, where
needed, based on their experience of testing Gemini models internally.
   External groups were given black-box testing access to a December 2023 Gemini API Ultra
model checkpoint over a number of weeks. Access enabled groups to undertake structured, batched
evaluations via the Cloud Vertex AI API or interact with the model via a chat interface, depending on
the type of testing being undertaken. These groups weren’t given access to the pre-trained model,
model weights, or queryable or direct external access to our pre-training data.
    The models tested by external groups were production-ready fine-tuned versions, which had
safety fine tuning and safety filters applied by default, and the ability to configure some sampling
parameters, such as temperature, token limit, Top-k, and Top-p. Groups that did testing via the


                                                                                                      38
                              Gemini: A Family of Highly Capable Multimodal Models



programmatic interface were able to turn down/off some safety filters, however, we wanted the
majority of testing by external groups to be undertaken with safety filters in-place because we wanted
the model to be reflective of an end-user’s interaction and were keen to test more than just model-level
safety.

7.4.5. Gemini Advanced

We undertook three types of external testing on Gemini Advanced:

   • Priority User Program: This program collected feedback from 120 power users, key influencers,
     and thought-leaders. This program enables the collection of real-time feedback across safety
     and other domain areas through the user interface, and where possible, in-depth interviews.
     Focus areas included safety and persona, functionality, coding and instruction capabilities, and
     factuality.
   • Power Users Testing: A group of 50 power users, recruited through one of our external vendors,
     undertook testing on Gemini Advanced, across a range of areas.
   • Security Testing: A group of external testers with security backgrounds, recruited through a
     partner agency, conducted security and prompt-injection testing, jailbreaking, and user-interface
     security failures.


7.5. Deployment

Following the completion of responsibility and safety reviews, internal model cards (Mitchell et al.,
2019) for each approved version of the Gemini model are created for structured and consistent internal
documentation of critical performance and responsibility metrics as well as to inform appropriate
external communication of these metrics over time.
   We release external model and system cards on an ongoing basis within updates of our technical
reports and in documentation for enterprise customers. See Appendix 10.1 for the Gemini Ultra
model card.
   Additionally, online content covering terms of use, model distribution and access, and operational
aspects such as change control, logging, monitoring and feedback can be found on relevant product
websites, such as Gemini and Cloud Vertex AI. Some of the key aspects are linked to or described
below:

   • Generative AI Prohibited Use Policy
   • Google Terms of service
   • Generative AI Terms of service
   • Google Cloud Platform Terms of service
   • Gemini Privacy Notice
   • Google Cloud Privacy Notice


8. Discussion and Conclusion
We have presented Gemini, a new family of models that advance multimodal model capabilities in
text, code, image, audio, and video. Our most capable pre-trained model Gemini Ultra, alongside
the post-trained Gemini Apps and Gemini API variants, make significant advances across the board.
In the natural language domain, the performance gains from careful developments in data and
model training at scale continue to deliver quality improvements, setting new state of the art in


                                                                                                      39
                              Gemini: A Family of Highly Capable Multimodal Models



several benchmarks. In particular, Gemini Ultra surpasses human-expert performance on the exam
benchmark MMLU, scoring 90.0%, which has been a defacto measure of progress for LLMs ever since
it was first released in 2020. In the multimodal domain, Gemini Ultra sets new state of the art on most
of the image understanding, video understanding, and audio understanding benchmarks without
task-specific modifications or tuning.In particular, Gemini Ultra’s multimodal reasoning capabilities
are evident from its state-of-the-art performance on the recent MMMU benchmark (Yue et al., 2023),
that comprises questions about images requiring college-level subject knowledge and deliberate
reasoning.
    Beyond the state-of-art results on benchmarks, what we are most excited about is the new use
cases enabled by Gemini models. The new capabilities of Gemini models to parse complex images,
such as charts or infographics, reason over interleaved sequences of images, audio, and text, and
generate interleaved text and images as responses open a wide variety of new applications. As shown
in figures throughout the report and appendix, Gemini models can enable new approaches in areas
like education, everyday problem solving, multilingual communication, information summarization,
extraction, and creativity. We expect that the users of these models will find all kinds of beneficial
new uses that we have only scratched the surface of in our own investigations.
    Despite their impressive capabilities, we should note that there are limitations to the use of LLMs.
There is a continued need for ongoing research and development on “hallucinations” generated by
LLMs to ensure that model outputs are more reliable and verifiable. LLMs also struggle with tasks
requiring high-level reasoning abilities like causal understanding, logical deduction, and counterfactual
reasoning even though they achieve impressive performance on exam benchmarks. This underscores
the need for more challenging and robust evaluations to measure their true understanding as the
current state-of-the-art LLMs saturate many benchmarks.
   The Gemini family is a further step towards our mission to solve intelligence, advance science
and benefit humanity, and we are enthusiastic to see how these models are used by our colleagues
at Google and beyond. We build on many innovations in machine learning, data, infrastructure,
and responsible development – areas that we have been pursuing at Google for over a decade. The
models we present in this report provide a strong foundation towards our broader future goal to
develop a large-scale, modularized system that will have broad generalization capabilities across
many modalities.




                                                                                                      40
                             Gemini: A Family of Highly Capable Multimodal Models



References
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel
  Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan
  Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian
  Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo
  Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. Flamingo: a visual language
  model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736,
  2022.

Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak
  Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El
  Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick,
  Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing Zhang, Gustavo Her-
  nandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha, James Bradbury, Siddhartha
  Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo,
  Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin,
  Mark Díaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag,
  Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi Hashemi,
  Le Hou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Ittycheriah,
  Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Lan,
  Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao
  Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni, Aroma Mahendru, Joshua Maynez, Vedant
  Misra, Maysam Moussalem, Zachary Nado, John Nham, Eric Ni, Andrew Nystrom, Alicia Parrish,
  Marie Pellat, Martin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter,
  Parker Riley, Alex Castro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose
  Slone, Daniel Smilkov, David R. So, Daniel Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan,
  Kiran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu,
  Kelvin Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng, Ce Zheng,
  Weikang Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu. PaLM 2 Technical Report, 2023.

Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan,
  Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language
  models. arXiv preprint arXiv:2108.07732, 2021. URL https://arxiv.org/abs/2108.07732.

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain,
  Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion,
  Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan
  Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei,
  Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, and Jared Kaplan. Training a
  helpful and harmless assistant with reinforcement learning from human feedback. April 2022a.
  URL https://arxiv.org/abs/2204.05862.

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones,
  Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson,
  Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson,
  Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile
  Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova
  DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El
  Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan,
  Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph,


                                                                                                  41
                              Gemini: A Family of Highly Capable Multimodal Models



  Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional AI: Harmlessness from AI feedback.
  arXiv preprint arXiv:2212.08073, 2022b.

Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Dan Hurt, Michael
  Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, Brennan Saeta, Parker Schuh, Ryan Sepassi,
  Laurent El Shafey, Chandramohan A. Thekkath, and Yonghui Wu. Pathways: Asynchronous
  distributed dataflow for ML. Proceedings of Machine Learning and Systems, 4:430–449, 2022.

Su Lin Blodgett, Gilsinia Lopez, Alexandra Olteanu, Robert Sim, and Hanna Wallach. Stereotyping
  Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets. In Proceedings of the
  59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint
  Conference on Natural Language Processing (Volume 1: Long Papers), pages 1004–1015, Online,
  August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.81. URL
  https://aclanthology.org/2021.acl-long.81.
James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin,
  George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX:
  composable transformations of Python+NumPy programs, 2018. URL http://github.com/
  google/jax.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
  Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-
  Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey
  Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
  Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario
  Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F.
  Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages
  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_
  files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared
 Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen
 Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray,
 Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter,
 Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth
 Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang,
 Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N.
 Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles
 Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish,
 Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. arXiv
 preprint arXiv:2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.

Xi Chen, Xiao Wang, Soravit Changpinyo, A J Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian
  Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver,
  Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James
  Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos Riquelme,
  Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, and Radu Soricut. PaLI: A jointly-
  scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022. URL https:
  //arxiv.org/abs/2209.06794.
Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Mustafa, Soravit Changpinyo, Jialin Wu, Car-
  los Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Yi Tay, Siamak Shakeri, Mostafa Dehghani,


                                                                                                    42
                              Gemini: A Family of Highly Capable Multimodal Models



  Daniel Salz, Mario Lucic, Michael Tschannen, Arsha Nagrani, Hexiang Hu, Mandar Joshi, Bo Pang,
  Ceslee Montgomery, Paulina Pietrzyk, Marvin Ritter, AJ Piergiovanni, Matthias Minderer, Filip
  Pavetic, Austin Waters, Gang Li, Ibrahim Alabdulmohsin, Lucas Beyer, Julien Amelot, Kenton
  Lee, Andreas Peter Steiner, Yang Li, Daniel Keysers, Anurag Arnab, Yuanzhong Xu, Keran Rong,
  Alexander Kolesnikov, Mojtaba Seyedhosseini, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, and
  Radu Soricut. PaLI-X: On Scaling up a Multilingual Vision and Language Model. arXiv preprint
  arXiv:2305.18565, 2023.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts,
  Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen
  Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer,
  Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradbury, Jacob
  Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke, Anselm Levskaya, Sanjay Ghemawat,
  Sunipa Dev, Henryk Michalewski, Xavier Garcia, Vedant Misra, Kevin Robinson, Liam Fedus, Denny
  Zhou, Daphne Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov, Ryan
  Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M. Dai, Thanumalayan Sankara-
  narayana Pillai, Marie Pellat, Aitor Lewkowycz, Erica Moreira, Rewon Child, Oleksandr Polozov,
  Katherine Lee, Zongwei Zhou, Xuezhi Wang, Brennan Saeta, Mark Diaz, Orhan Firat, Michele
  Catasta, Jason Wei, Kathy Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and Noah Fiedel.
  PaLM: Scaling Language Modeling with Pathways. Journal of Machine Learning Research, 24(240):
  1–113, 2023. URL http://jmlr.org/papers/v24/22-1144.html.

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina
  Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of
  the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:
  Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, 2019. URL
  https://aclanthology.org/N19-1300.
Jon Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and
  Jennimaria Palomaki. TydiQA: A benchmark for information-seeking question answering in typo-
  logically diverse languages. Transactions of the Association for Computational Linguistics, 2020. URL
  https://storage.googleapis.com/tydiqa/tydiqa.pdf.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher
  Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint
  arXiv:2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.

Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa,
  Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations
  of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805. IEEE, 2023.

Jeff    Dean.              Introducing
                                Pathways:      A    next-generation AI  archi-
  tecture,        2021.          URL      https://blog.google/technology/ai/
  introducing-pathways-next-generation-ai-architecture/.
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Mark Mao, Marc’aurelio Ranzato,
  Andrew Senior, Paul Tucker, Ke Yang, et al. Large scale distributed deep networks. Advances in
  neural information processing systems, 25, 2012.

Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath
  Muthiah, and Sriram Sankar. Silent data corruptions at scale. arXiv preprint arXiv:2102.11245,
 2021.


                                                                                                    43
                              Gemini: A Family of Highly Capable Multimodal Models



Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas
  Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit,
  and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In
  ICLR, 2020.

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner.
  DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In
  Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational
  Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378,
 2019. URL https://aclanthology.org/N19-1246.

Christian Federmann, Tom Kocmi, and Ying Xin. NTREX-128 – news test references for MT
  evaluation of 128 languages. In Proceedings of the First Workshop on Scaling Up Multilingual
  Evaluation, pages 21–24, Online, nov 2022. Association for Computational Linguistics. URL
  https://aclanthology.org/2022.sumeval-1.4.
Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. Realtoxici-
  typrompts: Evaluating neural toxic degeneration in language models, 2020.

Amelia Glaese, Nat McAleese, Maja Trębacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth
 Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan
 Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig,
 Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Soňa Mokrá, Nicholas Fernando,
 Boxi Wu, Rachel Foley, Susannah Young, Iason Gabriel, William Isaac, John Mellor, Demis Hassabis,
 Koray Kavukcuoglu, Lisa Anne Hendricks, and Geoffrey Irving. Improving alignment of dialogue
 agents via targeted human judgements, 2022. URL https://arxiv.org/abs/2209.14375.

Google.  Google’s AI Principles.           2023.        URL https://ai.google/responsibility/
  principles/.
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA
  matter: Elevating the role of image understanding in visual question answering. In Proceedings of
  the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.

Tahmid Hasan, Abhik Bhattacharjee, Md. Saiful Islam, Kazi Mubasshir, Yuan-Fang Li, Yong-Bin Kang,
  M. Sohel Rahman, and Rifat Shahriyar. XL-sum: Large-scale multilingual abstractive summarization
  for 44 languages. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021,
  pages 4693–4703, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/
  v1/2021.findings-acl.413. URL https://aclanthology.org/2021.findings-acl.413.

Qianyu He, Jie Zeng, Wenhao Huang, Lina Chen, Jin Xiao, Qianxi He, Xunzhe Zhou, Lida Chen,
  Xintao Wang, Yuncheng Huang, et al. Can large language models understand real-world complex
  instructions? arXiv preprint arXiv:2309.09150, 2023.

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob
  Steinhardt. Measuring massive multitask language understanding. Proceedings of the International
  Conference on Learning Representations (ICLR), 2021a.

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song,
  and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. arXiv
  preprint arXiv:2103.03874, 2021b. URL https://arxiv.org/abs/2103.03874.




                                                                                                    44
                            Gemini: A Family of Highly Capable Multimodal Models



Peter H Hochschild, Paul Turner, Jeffrey C Mogul, Rama Govindaraju, Parthasarathy Ranganathan,
  David E Culler, and Amin Vahdat. Cores that don’t count. In Proceedings of the Workshop on Hot
  Topics in Operating Systems, pages 9–16, 2021.

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Ruther-
  ford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric
  Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero,
  Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. Training compute-
  optimal large language models. arXiv preprint arXiv:2203.15556, 2022.

Shengding Hu, Yifan Luo, Huadong Wang, Xingyi Cheng, Zhiyuan Liu, and Maosong Sun. Won’t get
  fooled again: Answering questions with false premises. arXiv preprint arXiv:2307.02394, 2023.

EunJeong Hwang and Vered Shwartz. Memecap: A dataset for captioning and interpreting memes,
  2023.

Norman P. Jouppi, Doe Hyun Yoon, George Kurian, Sheng Li, Nishant Patil, James Laudon, Cliff
 Young, and David A. Patterson. A domain-specific supercomputer for training deep neural networks.
 Commun. ACM, 63(7):67–78, 2020. doi: 10.1145/3360307. URL https://doi.org/10.1145/
  3360307.
Norman P Jouppi, George Kurian, Sheng Li, Peter Ma, Rahul Nagarajan, Lifeng Nai, Nishant Patil,
  Suvinay Subramanian, Andy Swing, Brian Towles, Cliff Young, Xiang Zhou, Zongwei Zhou, and
  David A Patterson. Tpu v4: An optically reconfigurable supercomputer for machine learning with
  hardware support for embeddings. In Proceedings of the 50th Annual International Symposium on
 Computer Architecture, pages 1–14, 2023.

Ashwin Kalyan, Abhinav Kumar, Arjun Chandrasekaran, Ashish Sabharwal, and Peter Clark. How
  Much Coffee Was Consumed During EMNLP 2019? Fermi Problems: A New Reasoning Challenge
  for AI, 2021.

Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir
  Radev, Noah A Smith, Yejin Choi, and Kentaro Inui. Realtime qa: What’s the answer right now?
  arXiv preprint arXiv:2207.13332, 2022a.

Jungo Kasai, Keisuke Sakaguchi, Yoichi Takahashi, Ronan Le Bras, Akari Asai, Xinyan Yu, Dragomir
  Radev, Noah A. Smith, Yejin Choi, and Kentaro Inui. RealTime QA: What’s the answer right now?,
  2022b. URL https://arxiv.org/abs/2207.13332.

K Kavukcuoglu, P Kohli, L Ibrahim, D Bloxwich, and S Brown. How our principles helped define
  AlphaFold’s release. Google DeepMind, 2022.

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi.
  A diagram is worth a dozen images. In ECCV, 2016.

Megan Kinniment, Lucas Jun Koba Sato, Haoxing Du, Brian Goodrich, Max Hasin, Lawrence Chan,
 Luke Harold Miles, Tao R Lin, Hjalmar Wijk, Joel Burget, et al. Evaluating language-model agents
 on realistic autonomous tasks. arXiv preprint arXiv:2312.11671, 2023.

Tomáš Kočiský, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis,
  and Edward Grefenstette. The NarrativeQA reading comprehension challenge. Transactions of
  the Association for Computational Linguistics, 6:317–328, 2018. doi: 10.1162/tacl_a_00023. URL
  https://aclanthology.org/Q18-1023.



                                                                                               45
                              Gemini: A Family of Highly Capable Multimodal Models



Tom Kocmi, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel,
  Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp
  Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Novák,
  Martin Popel, and Maja Popović. Findings of the 2022 conference on machine translation (WMT22).
  In Proceedings of the Seventh Conference on Machine Translation (WMT), December 2022. URL
  https://aclanthology.org/2022.wmt-1.1.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large
  language models are zero-shot reasoners. Advances in neural information processing systems, 35:
  22199–22213, 2022.
Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword
  tokenizer and detokenizer for neural text processing. In Eduardo Blanco and Wei Lu, editors,
  Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP
  2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, pages 66–71.
  Association for Computational Linguistics, 2018. doi: 10.18653/v1/d18-2012. URL https:
  //doi.org/10.18653/v1/d18-2012.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti,
  Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones,
  Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov.
  Natural questions: A benchmark for question answering research. Transactions of the Association
  for Computational Linguistics, 7:452–466, 2019a. doi: 10.1162/tacl_a_00276. URL https:
  //aclanthology.org/Q19-1026.
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti,
  Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark
  for question answering research. Transactions of the Association for Computational Linguistics, 7:
  453–466, 2019b.
Faisal Ladhak, Esin Durmus, Claire Cardie, and Kathleen McKeown. WikiLingua: A new benchmark
  dataset for cross-lingual abstractive summarization. In Findings of the Association for Computational
  Linguistics: EMNLP 2020, pages 4034–4048, Online, November 2020. Association for Computational
  Linguistics. doi: 10.18653/v1/2020.findings-emnlp.360. URL https://www.aclweb.org/
  anthology/2020.findings-emnlp.360.
Leblond et al. AlphaCode 2 Technical Report. 2023. URL https://storage.googleapis.com/
  deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf.
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom
  Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation
  with alphacode. Science, 378(6624):1092–1097, 2022.
Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual
  representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
Fangyu Liu, Julian Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee,
  Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun. DePlot: One-shot visual language
  reasoning by plot-to-table translation. In Findings of the Association for Computational Linguistics:
  ACL 2023, pages 10381–10399, Toronto, Canada, July 2023. Association for Computational
  Linguistics. doi: 10.18653/v1/2023.findings-acl.660. URL https://aclanthology.org/2023.
  findings-acl.660.


                                                                                                    46
                             Gemini: A Family of Highly Capable Multimodal Models



Pan Lu, Ran Gong, Shibiao Jiang, Liang Qiu, Siyuan Huang, Xiaodan Liang, and Song-Chun Zhu.
  Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning.
  In The Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics
  and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP 2021),
  2021.

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-
  Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of
  foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.

Ahmed Masry, Do Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for
  question answering about charts with visual and logical reasoning. In Findings of ACL, 2022.

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document
 images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages
 2200–2209, 2021.

Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis Karatzas, Ernest Valveny, and CV Jawahar.
 Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision,
 pages 1697–1706, 2022.

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality
  in abstractive summarization. arXiv preprint arXiv:2005.00661, 2020.

Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia
  Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, and Nat McAleese. Teaching
  language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147, 2022.

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct
  electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference
  on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, October-
  November 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1260. URL
  https://aclanthology.org/D18-1260.
Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization
  via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021.

Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary!
  topic-aware convolutional neural networks for extreme summarization. In Proceedings of the
  2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels,
  Belgium, October-November 2018. Association for Computational Linguistics. doi: 10.18653/v1/
  D18-1206. URL https://aclanthology.org/D18-1206.

Oktatási Hivatal.   Matematika írásbéli vizsga.         Középszintű Írásbéli Vizsga, May 2023.   URL
  https://dload-oktatas.educatio.hu/erettsegi/feladatok_2023tavasz_kozep/
  k_matang_23maj_fl.pdf. Angol Nyelven.
OpenAI. GPT-4 Technical Report. 2023a.

OpenAI. GPT-4V(ision) System Card, 2023b.

OpenAI. Whisper, 2023. URL https://github.com/openai/whisper.




                                                                                                  47
                             Gemini: A Family of Highly Capable Multimodal Models



Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong
  Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, John Schulman, Jacob Hilton, Fraser Kelton,
  Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan
  Lowe. Training language models to follow instructions with human feedback. In Alice H. Oh, Alekh
  Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing
  Systems, 2022. URL https://openreview.net/forum?id=TG8KACxEON.

Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro
  Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. The LAMBADA dataset: Word
  prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031, 2016.

Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, Jana Thompson,
  Phu Mon Htut, and Samuel R. Bowman. BBQ: A hand-built bias benchmark for question answering.
  CoRR, abs/2110.08193, 2021. URL https://arxiv.org/abs/2110.08193.

Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan
  Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana
  Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin
  Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, and
  Joăo Carreira. Perception test: A diagnostic benchmark for multimodal video models. arXiv preprint
  arXiv:2305.13786, 2023.

Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden,
  Zhou Yu, Weizhu Chen, et al. Check your facts and try again: Improving large language models
  with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.

Leon Poutievski, Omid Mashayekhi, Joon Ong, Arjun Singh, Mukarram Tariq, Rui Wang, Jianan Zhang,
  Virginia Beauregard, Patrick Conner, Steve Gribble, et al. Jupiter evolving: transforming google’s
  datacenter network via optical circuit switches and software-defined networking. In Proceedings of
  the ACM SIGCOMM 2022 Conference, pages 66–85, 2022.

Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: A
  large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411, 2020.

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
  Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019. URL
  https://d4mucfpksywv.cloudfront.net/better-language-models/language_
  models_are_unsupervised_multitask_learners.pdf.
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever.
  Robust speech recognition via large-scale weak supervision. In International Conference on Machine
  Learning, pages 28492–28518. PMLR, 2023.

Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, H. Francis Song,
  John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan,
  Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks,
  Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron
  Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu,
  Erich Elsen, Siddhant M. Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen
  Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro,
  Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch, Jean-
  Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux, Mantas
  Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume, Yujia Li, Tayfun


                                                                                                  48
                             Gemini: A Family of Highly Capable Multimodal Models



  Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas, Aurelia Guy, Chris Jones,
  James Bradbury, Matthew Johnson, Blake A. Hechtman, Laura Weidinger, Iason Gabriel, William S.
  Isaac, Edward Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol Vinyals, Kareem Ayoub,
  Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu, and Geoffrey Irving. Scaling
  language models: Methods, analysis & insights from training Gopher. CoRR, abs/2112.11446,
  2021.

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen,
  and Ilya Sutskever. Zero-shot text-to-image generation. In International Conference on Machine
  Learning, pages 8821–8831. PMLR, 2021.

Hannah Rashkin, Vitaly Nikolaev, Matthew Lamm, Lora Aroyo, Michael Collins, Dipanjan Das, Slav
 Petrov, Gaurav Singh Tomar, Iulia Turc, and David Reitter. Measuring attribution in natural language
 generation models. Computational Linguistics, pages 1–64, 2023.

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel
  Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake
  Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar
  Bordbar, and Nando de Freitas. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.

Parker Riley, Timothy Dozat, Jan A Botha, Xavier Garcia, Dan Garrette, Jason Riesa, Orhan Firat, and
  Noah Constant. Frmt: A benchmark for few-shot region-aware machine translation. Transactions of
  the Association for Computational Linguistics, 2023.

Hannah Ritchie, Veronika Samborska, and Max Roser. Plastic pollution. Our World in Data, 2023.
 https://ourworldindata.org/plastic-pollution.

Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the
  parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods
  in Natural Language Processing (EMNLP), pages 5418–5426, Online, November 2020. Associ-
  ation for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.437. URL https:
 //aclanthology.org/2020.emnlp-main.437.
William A Gaviria Rojas, Sudnya Diamos, Keertan Ranjan Kini, David Kanter, Vijay Janapa Reddi, and
 Cody Coleman. The dollar street dataset: Images representing the geographic and socioeconomic
 diversity of the world. In Thirty-sixth Conference on Neural Information Processing Systems Datasets
 and Benchmarks Track, 2022.

Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in
  coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the
  Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers),
  pages 8–14, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi:
 10.18653/v1/N18-2002. URL https://aclanthology.org/N18-2002.

Candice Schumann, Susanna Ricco, Utsav Prabhu, Vittorio Ferrari, and Caroline Pantofaru. A step
  toward more inclusive people annotations for fairness. In Proceedings of the 2021 AAAI/ACM
  Conference on AI, Ethics, and Society, pages 916–925, 2021.

Andrew D. Selbst, Danah Boyd, and Sorelle A. Friedler. Fairness and abstraction in sociotechnical
  systems. In FFAT* ’19: Proceedings of the Conference on Fairness, Accountability, and Transparency,
  pages 59–68, January 2019.




                                                                                                   49
                             Gemini: A Family of Highly Capable Multimodal Models



Thibault Sellam, Dipanjan Das, and Ankur Parikh. BLEURT: Learning robust metrics for text generation.
  In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages
  7881–7892, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.
  acl-main.704. URL https://aclanthology.org/2020.acl-main.704.

Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor
  Geva, Jonathan Berant, and Omer Levy. SCROLLS: Standardized CompaRison over long language
  sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language
  Processing, pages 12007–12021, Abu Dhabi, United Arab Emirates, December 2022. Association for
  Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.823.

Noam Shazeer. Fast transformer decoding: One write-head is all you need.              arXiv preprint
  arXiv:1911.02150, 2019a.

Noam Shazeer. Fast transformer decoding: One write-head is all you need.              arXiv preprint
  arXiv:1911.02150, 2019b.

Renee Shelby, Shalaleh Rismani, Kathryn Henne, AJung Moon, Negar Rostamzadeh, Paul Nicholas,
  N’Mah Yilla, Jess Gallegos, Andrew Smart, Emilio Garcia, and Gurleen Virk. Identifying so-
  ciotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction, 2023. URL
  https://arxiv.org/abs/2210.05791.
Toby Shevlane, Sebastian Farquhar, Ben Garfinkel, Mary Phuong, Jess Whittlestone, Jade Leung,
  Daniel Kokotajlo, Nahema Marchal, Markus Anderljung, Noam Kolt, Lewis Ho, Divya Siddarth,
  Shahar Avin, Will Hawkins, Been Kim, Iason Gabriel, Vijay Bolina, Jack Clark, Yoshua Bengio, Paul
  Christiano, and Allan Dafoe. Model evaluation for extreme risks. arXiv preprint arXiv:2305.15324,
  2023.

Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won
  Chung, Yi Tay, Sebastian Ruder, Denny Zhou, et al. Language models are multilingual chain-of-
  thought reasoners. ICLR, 2023.

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and
 Marcus Rohrbach. Towards VQA models that can read. In Proceedings of the IEEE/CVF conference
 on computer vision and pattern recognition, pages 8317–8326, 2019.

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch,
  Adam R. Brown, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities
  of language models. arXiv preprint arXiv:2206.04615, 2022. URL https://arxiv.org/abs/
  2206.04615.
Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.
   Advances in neural information processing systems, 27, 2014.

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung,
 Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big-bench tasks and
 whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.

Oyvind Tafjord, Bhavana Dalvi, and Peter Clark. Proof Writer: Generating implications, proofs,
 and abductive statements over natural language. In Findings, 2020. URL https://api.
 semanticscholar.org/CorpusID:229371222.




                                                                                                  50
                               Gemini: A Family of Highly Capable Multimodal Models



NLLB Team, Marta R. Costa-jussà, James Cross, Onur Çelebi, Maha Elbayad, Kenneth Heafield,
  Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang,
 Guillaume Wenzek, Al Youngblood, Bapi Akula, Loic Barrault, Gabriel Mejia Gonzalez, Prangthip
  Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadagopan, Dirk Rowe, Shannon Spruit,
 Chau Tran, Pierre Andrews, Necip Fazil Ayan, Shruti Bhosale, Sergey Edunov, Angela Fan, Cynthia
 Gao, Vedanuj Goswami, Francisco Guzmán, Philipp Koehn, Alexandre Mourachko, Christophe
  Ropers, Safiyyah Saleem, Holger Schwenk, and Jeff Wang. No language left behind: Scaling
  human-centered machine translation. 2022.

Ashish V. Thapliyal, Jordi Pont-Tuset, Xi Chen, and Radu Soricut. Crossmodal-3600: A massively
  multilingual multimodal evaluation dataset. In EMNLP, 2022.

Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng,
  Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. LaMDA: Language models for dialog applications.
  arXiv preprint arXiv:2201.08239, 2022. URL https://arxiv.org/abs/2201.08239.

Kocmi Tom, Eleftherios Avramidis, Rachel Bawden, Ondřej Bojar, Anton Dvorkovich, Christian
  Federmann, Mark Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, et al. Findings
  of the 2023 conference on machine translation (wmt23): Llms are here but not quite there yet. In
  WMT23-Eighth Conference on Machine Translation, pages 198–216, 2023.

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée
 Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand
 Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language
 models. arXiv preprint arXiv:2302.13971, 2023a.

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay
 Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Can-
 ton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu,
 Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini,
 Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Ko-
 renev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai
 Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew
 Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael
 Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang
 Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan
 Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open
 foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz
  Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wal-
  lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing
  Systems, volume 30. Curran Associates, Inc., 2017a. URL https://proceedings.neurips.cc/
  paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz
  Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017b. URL
  http://arxiv.org/abs/1706.03762.
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image descrip-
  tion evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition,
  pages 4566–4575, 2015.


                                                                                                       51
                             Gemini: A Family of Highly Capable Multimodal Models



Petar Veličković, Adrià Puigdomènech Badia, David Budden, Razvan Pascanu, Andrea Banino, Misha
  Dashevskiy, Raia Hadsell, and Charles Blundell. The clrs algorithmic reasoning benchmark. arXiv
  preprint arXiv:2205.15659, 2022.
Manoj Vishwanathan, Ronak Shah, Kyung Ki Kim, and Minsu Choi. Silent data corruption (sdc)
 vulnerability of gpu on various gpgpu workloads. In 2015 International SoC Design Conference
 (ISOCC), pages 11–12, 2015. doi: 10.1109/ISOCC.2015.7401681.
Changhan Wang, Anne Wu, and Juan Pino. Covost 2 and massively multilingual speech-to-text
  translation. arXiv preprint arXiv:2007.10310, 2020.
Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary
  Williamson, Juan Pino, and Emmanuel Dupoux. Voxpopuli: A large-scale multilingual speech
  corpus for representation learning, semi-supervised learning and interpretation. arXiv preprint
  arXiv:2101.00390, 2021.
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. VATEX: A
  large-scale, high-quality multilingual dataset for video-and-language research. In ICCV, 2019.
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. Self-consistency
  improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du,
  Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. Proceedings of the
  International Conference on Learning Representations (ICLR), 2022a. URL https://openreview.
  net/forum?id=gEZrGCozdqR.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le,
  and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS,
  2022b. URL https://arxiv.org/abs/2201.11903.
Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra
  Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom
  Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William S.
  Isaac, Sean Legassick, Geoffrey Irving, and Iason Gabriel. Ethical and social risks of harm from
  language models. CoRR, abs/2112.04359, 2021. URL https://arxiv.org/abs/2112.04359.
David Wetherall, Abdul Kabbani, Van Jacobson, Jim Winget, Yuchung Cheng, Brad Morrey,
  Uma Parthavi Moravapalle, Phillipa Gill, Steven Knight, and Amin Vahdat. Improving network
  availability with protective reroute. In SIGCOMM 2023, 2023. URL https://dl.acm.org/doi/
  10.1145/3603269.3604867.
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. NExT-QA: Next phase of question-answering
  to explaining temporal actions. In CVPR, 2021.
XLA. XLA: Optimizing compiler for TensorFlow. https://www.tensorflow.org/xla, 2019.
  [Online; accessed December-2023].
Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin
  Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint
  arXiv:2304.12244, 2023.
Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim
  Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, et al. Gspmd: general and scalable paral-
  lelization for ml computation graphs. arXiv preprint arXiv:2105.04663, 2021.


                                                                                                  52
                             Gemini: A Family of Highly Capable Multimodal Models



Chi yao Hong, Subhasree Mandal, Mohammad A. Alfares, Min Zhu, Rich Alimi, Kondapa Naidu
  Bollineni, Chandan Bhagat, Sourabh Jain, Jay Kaimal, Jeffrey Liang, Kirill Mendelev, Steve Padgett,
  Faro Thomas Rabe, Saikat Ray, Malveeka Tewari, Matt Tierney, Monika Zahn, Jon Zolla, Joon
  Ong, and Amin Vahdat. B4 and after: Managing hierarchy, partitioning, and asymmetry for
  availability and scale in google’s software-defined wan. In SIGCOMM’18, 2018. URL https:
  //conferences.sigcomm.org/sigcomm/2018/program_tuesday.html.
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca:
   Contrastive captioners are image-text foundation models, 2022a.

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan,
  Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin
   Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich
   text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022b.

Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Self-chained image-language model for
  video localization and question answering. arXiv preprint arXiv:2305.06988, 2023.

Zhou Yu, Dejing Xu, Jun Yu, Ting Yu, Zhou Zhao, Yueting Zhuang, and Dacheng Tao. ActivityNet-QA:
  A dataset for understanding complex web videos via question answering. In AAAI, 2019.

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu
  Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin,
  Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen.
  Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert
  agi, 2023.

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine
  really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.

Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen,
  Bo Li, Vera Axelrod, Gary Wang, Zhong Meng, Ke Hu, Andrew Rosenberg, Rohit Prabhavalkar,
  Daniel S. Park, Parisa Haghani, Jason Riesa, Ginger Perng, Hagen Soltau, Trevor Strohman, Bhuvana
  Ramabhadran, Tara Sainath, Pedro Moreno, Chung-Cheng Chiu, Johan Schalkwyk, Françoise
  Beaufays, and Yonghui Wu. Google usm: Scaling automatic speech recognition beyond 100
  languages. arXiv preprint arXiv:2303.01037, 2023.

Dora Zhao, Angelina Wang, and Olga Russakovsky. Understanding and evaluating racial biases in
  image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision,
  pages 14830–14840, 2021.

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in
   coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876, 2018.

Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and Yu Li. Progressive-hint prompting
  improves reasoning in large language models, 2023.

Luowei Zhou, Chenliang Xu, and Jason J Corso. Towards automatic learning of procedures from web
  instructional videos. In AAAI Conference on Artificial Intelligence, pages 7590–7598, 2018.




                                                                                                  53
Gemini: A Family of Highly Capable Multimodal Models




                                                       54
                            Gemini: A Family of Highly Capable Multimodal Models



9. Contributions and Acknowledgments
Gemini Leads
Rohan Anil, Co-Lead, Text                                      Manaal Faruqui, Co-Lead, Gemini App Core
Sebastian Borgeaud, Co-Lead, Text                              Modeling, Factuality, Instruction Following
Jean-Baptiste Alayrac, Co-Lead, MM Vision                      Aliaksei Severyn, Co-Lead, Gemini App Core
Jiahui Yu, Co-Lead, MM Vision                                  Modeling, Conversationality
Radu Soricut, Co-Lead, MM Vision                               Hanzhao Lin, Co-Lead, Gemini App Fine-Tuning
Johan Schalkwyk, Lead, MM Audio                                YaGuang Li, Co-Lead, Gemini App Fine-Tuning
Andrew M. Dai, Co-Lead, Data                                   Yong Cheng, Co-Lead, Gemini App Fine-Tuning
Anja Hauth, Co-Lead, Data                                      Abe Ittycheriah, Co-Lead, Gemini for Gemini App
Katie Millican, Co-Lead, Data                                  Mahdis Mahdieh, Co-Lead, Gemini for Gemini App
David Silver, Co-Lead, Fine-Tuning                             Mia Chen, Co-Lead, Gemini for Gemini App
Melvin Johnson, Lead, Instruction Tuning                       Pei Sun, Co-Lead, Gemini for Gemini App
Ioannis Antonoglou, Co-Lead, RL Techniques                     Dustin Tran, Co-Lead, Gemini App Eval
Julian Schrittwieser, Co-Lead, RL Techniques                   Sumit Bagri, Co-Lead, Gemini App Eval, Technical
Amelia Glaese, Lead, Human Data                                Program Management
Jilin Chen, Lead, Safety                                       Balaji Lakshminarayanan, Co-Lead, Gemini App
Emily Pitler, Co-Lead, Tool Use                                AutoEval
Timothy Lillicrap, Co-Lead, Tool Use                           Jeremiah Liu, Co-Lead, Gemini App AutoEval
Angeliki Lazaridou, Co-Lead, Eval                              Andras Orban, Co-Lead, Gemini App Factuality,
Orhan Firat, Co-Lead, Eval                                     Multimodality, Safety
James Molloy, Co-Lead, Infra                                   Fabian Güra, Co-Lead, Gemini App Factuality
Michael Isard, Co-Lead, Infra                                  Hao Zhou, Co-Lead, Gemini App Factuality
Paul R. Barham, Co-Lead, Infra                                 Xinying Song, Co-Lead, Gemini App Factuality
Tom Hennigan, Co-Lead, Infra                                   Aurelien Boffy, Co-Lead, Gemini App Safety
Benjamin Lee, Co-Lead, Codebase & Parallelism                  Harish Ganapathy, Co-Lead, Gemini Safety
Fabio Viola, Co-Lead, Codebase & Parallelism                   Steven Zheng, Lead, Gemini App Multilinguality
Malcolm Reynolds, Co-Lead, Codebase & Parallelism              Research
Yuanzhong Xu, Co-Lead, Codebase & Parallelism                  HyunJeong Choe, Lead, Gemini App Multilinguality
Ryan Doherty, Lead, Ecosystem                                  Ágoston Weisz, Co-Lead, Gemini App Multimodality
Eli Collins, Lead, Product                                     Tao Zhu, Co-Lead, Gemini App Multimodality
Clemens Meyer, Co-Lead, Operations                             Yifeng Lu, Co-Lead, Gemini App Multimodality
Eliza Rutherford, Co-Lead, Operations                          Siddharth Gopal, Co-Lead, Gemini App Coding &
Erica Moreira, Co-Lead, Operations                             Tool Use
Kareem Ayoub, Co-Lead, Operations                              Jarrod Kahn, Co-Lead, Gemini App Tool Use
Megha Goel, Co-Lead, Operations                                Research
                                                               Maciej Kula, Co-Lead, Gemini App Tool Use Research
Gemini App Leads                                               Jeff Pitman, Co-Lead, Gemini App Tool Use
Jack Krawczyk, Lead, Gemini App Product                        Rushin Shah, Co-Lead, Gemini App Tool Use
Cosmo Du, Co-Lead, Gemini App Research                         Emanuel Taropa, Co-Lead, Gemini App Serving
Ed Chi, Co-Lead, Gemini App Research                           Majd Al Merey, Co-Lead, Gemini App Serving
Heng-Tze Cheng, Co-Lead, Gemini App Research                   Martin Baeuml, Co-Lead, Gemini App Serving
Eric Ni, Lead, Gemini App Research Technical Program           Zhifeng Chen, Co-Lead, Gemini App Serving
Management                                                     Laurent El Shafey, Co-Lead, Gemini App Fine-Tuning
Purvi Shah, Lead, Gemini App Technical Program                 Infra
Management                                                     Yujing Zhang, Co-Lead, Gemini App Fine-Tuning
Patrick Kane, Co-Lead, Gemini App Core Modeling, Eval,         Infra
Data, Product                                                  Olcan Sercinoglu, Lead, Gemini App Product
Betty Chan, Co-Lead, Gemini App Core Modeling,
Technical Program Management

                                                                                                      55
                      Gemini: A Family of Highly Capable Multimodal Models



Core Contributors                                Core Contributors
George Tucker                                    Gaurav Singh Tomar
Enrique Piqueras                                 Evan Senter
Maxim Krikun                                     Martin Chadwick
Iain Barr                                        Ilya Kornakov
Nikolay Savinov                                  Nithya Attaluri
Ivo Danihelka                                    Iñaki Iturrate
Becca Roelofs                                    Ruibo Liu
Anaïs White                                      Yunxuan Li
Anders Andreassen                                Sarah Cogan
Tamara von Glehn                                 Jeremy Chen
Lakshman Yagati                                  Chao Jia
Mehran Kazemi                                    Chenjie Gu
Lucas Gonzalez                                   Qiao Zhang
Misha Khalman                                    Jordan Grimstad
Jakub Sygnowski                                  Ale Jakse Hartman
Alexandre Frechette                              Xavier Garcia
Charlotte Smith                                  Thanumalayan Sankaranarayana Pillai
Laura Culp                                       Jacob Devlin
Lev Proleev                                      Michael Laskin
Yi Luan                                          Diego de Las Casas
Xi Chen                                          Dasha Valter
James Lottes                                     Connie Tao
Nathan Schucher                                  Lorenzo Blanco
Federico Lebron                                  Adrià Puigdomènech Badia
Alban Rrustemi                                   David Reitter
Natalie Clay                                     Mianna Chen
Phil Crone                                       Jenny Brennan
Tomas Kocisky                                    Clara Rivera
Jeffrey Zhao                                     Sergey Brin
Bartek Perz                                      Shariq Iqbal
Dian Yu                                          Gabriela Surita
Heidi Howard                                     Jane Labanowski
Adam Bloniarz                                    Abhi Rao
Jack W. Rae                                      Stephanie Winkler
Han Lu                                           Emilio Parisotto
Laurent Sifre                                    Yiming Gu
Marcello Maggioni                                Kate Olszewska
Fred Alcober                                     Ravi Addanki
Dan Garrette                                     Antoine Miech
Megan Barnes                                     Annie Louis
Shantanu Thakoor                                 Denis Teplyashin
Jacob Austin                                     Geoff Brown
Gabriel Barth-Maron                              Elliot Catt
William Wong                                     Jan Balaguer
Rishabh Joshi                                    Jackie Xiang
Rahma Chaabouni                                  Pidong Wang
Deeni Fatiha                                     Zoe Ashwood
Arun Ahuja                                       Anton Briukhov



                                                                                       56
                         Gemini: A Family of Highly Capable Multimodal Models



Core Contributors                                   Core Contributors
Albert Webson                                       Dipanjan Das
Sanjay Ganapathy                                    Dominika Rogozińska
Smit Sanghavi                                       Vitaly Nikolaev
Ajay Kannan                                         Pablo Sprechmann
Ming-Wei Chang                                      Zachary Nado
Axel Stjerngren                                     Lukas Zilka
Josip Djolonga                                      Flavien Prost
Yuting Sun                                          Luheng He
Ankur Bapna                                         Marianne Monteiro
Matthew Aitchison                                   Gaurav Mishra
Pedram Pejman                                       Chris Welty
Henryk Michalewski                                  Josh Newlan
Tianhe Yu                                           Dawei Jia
Cindy Wang                                          Miltiadis Allamanis
Juliette Love                                       Clara Huiyi Hu
Junwhan Ahn                                         Raoul de Liedekerke
Dawn Bloxwich                                       Justin Gilmer
Kehang Han                                          Carl Saroufim
Peter Humphreys                                     Shruti Rijhwani
Thibault Sellam                                     Shaobo Hou
James Bradbury                                      Disha Shrivastava
Varun Godbole                                       Anirudh Baddepudi
Sina Samangooei                                     Alex Goldin
Bogdan Damoc                                        Adnan Ozturel
Alex Kaskasoli                                      Albin Cassirer
Sébastien M. R. Arnold                              Yunhan Xu
Vijay Vasudevan                                     Daniel Sohn
Shubham Agrawal                                     Devendra Sachan
Jason Riesa                                         Reinald Kim Amplayo
Dmitry Lepikhin                                     Craig Swanson
Richard Tanburn                                     Dessie Petrova
Srivatsan Srinivasan                                Shashi Narayan
Hyeontaek Lim                                       Arthur Guez
Sarah Hodkinson                                     Siddhartha Brahma
Pranav Shyam                                        Jessica Landon
Johan Ferret                                        Miteyan Patel
Steven Hand                                         Ruizhe Zhao
Ankush Garg                                         Kevin Villela
Tom Le Paine                                        Luyu Wang
Jian Li                                             Wenhao Jia
Yujia Li                                            Matthew Rahtz
Minh Giang                                          Mai Giménez
Alexander Neitz                                     Legg Yeung
Zaheer Abbas                                        James Keeling
Sarah York                                          Petko Georgiev
Machel Reid                                         Diana Mincu
Elizabeth Cole                                      Boxi Wu
Aakanksha Chowdhery                                 Salem Haykal



                                                                                57
                           Gemini: A Family of Highly Capable Multimodal Models



Core Contributors                                     Core Contributors
Rachel Saputro                                        Sebastian Riedel
Kiran Vodrahalli                                      Paige Bailey
James Qin                                             Kefan Xiao
Zeynep Cankara                                        Nimesh Ghelani
Abhanshu Sharma                                       Lora Aroyo
Nick Fernando                                         Ambrose Slone
Will Hawkins                                          Neil Houlsby
Behnam Neyshabur                                      Xuehan Xiong
Solomon Kim                                           Zhen Yang
Adrian Hutter                                         Elena Gribovskaya
Priyanka Agrawal                                      Jonas Adler
Alex Castro-Ros                                       Mateo Wirth
George van den Driessche                              Lisa Lee
Tao Wang                                              Music Li
Fan Yang                                              Thais Kagohara
Shuo-yiin Chang                                       Jay Pavagadhi
Paul Komarek                                          Sophie Bridgers
Ross McIlroy                                          Anna Bortsova
Mario Lučić                                           Sanjay Ghemawat
Guodong Zhang                                         Zafarali Ahmed
Wael Farhan                                           Tianqi Liu
Michael Sharman                                       Richard Powell
Paul Natsev                                           Vijay Bolina
Paul Michel                                           Mariko Iinuma
Yamini Bansal                                         Polina Zablotskaia
Siyuan Qiao                                           James Besley
Kris Cao                                              Da-Woon Chung
Siamak Shakeri                                        Timothy Dozat
Christina Butterfield                                 Ramona Comanescu
Justin Chung                                          Xiance Si
Paul Kishan Rubenstein                                Jeremy Greer
Shivani Agrawal                                       Guolong Su
Arthur Mensch                                         Martin Polacek
Kedar Soparkar                                        Raphaël Lopez Kaufman
Karel Lenc                                            Simon Tokumine
Timothy Chung                                         Hexiang Hu
Aedan Pope                                            Elena Buchatskaya
Loren Maggiore                                        Yingjie Miao
Jackie Kay                                            Mohamed Elhawaty
Priya Jhakra                                          Aditya Siddhant
Shibo Wang                                            Nenad Tomasev
Joshua Maynez                                         Jinwei Xing
Mary Phuong                                           Christina Greer
Taylor Tobin                                          Helen Miller
Andrea Tacchetti                                      Shereen Ashraf
Maja Trebacz                                          Aurko Roy
Kevin Robinson                                        Zizhao Zhang
Yash Katariya                                         Ada Ma



                                                                                  58
                       Gemini: A Family of Highly Capable Multimodal Models



Core Contributors                                 Core Contributors
Angelos Filos                                     Ethan Dyer
Milos Besta                                       Víctor Campos
Rory Blevins                                      Alex Tomala
Ted Klimenko                                      Yunhao Tang
Chih-Kuan Yeh                                     Dalia El Badawy
Soravit Changpinyo                                Elspeth White
Jiaqi Mu                                          Basil Mustafa
Oscar Chang                                       Oran Lang
Mantas Pajarskas                                  Abhishek Jindal
Carrie Muir                                       Sharad Vikram
Vered Cohen                                       Zhitao Gong
Charline Le Lan                                   Sergi Caelles
Krishna Haridasan                                 Ross Hemsley
Amit Marathe                                      Gregory Thornton
Steven Hansen                                     Fangxiaoyu Feng
Sholto Douglas                                    Wojciech Stokowiec
Rajkumar Samuel                                   Ce Zheng
Mingqiu Wang                                      Phoebe Thacker
Sophia Austin                                     Çağlar Ünlü
Chang Lan                                         Zhishuai Zhang
Jiepu Jiang                                       Mohammad Saleh
Justin Chiu                                       James Svensson
Jaime Alonso Lorenzo                              Max Bileschi
Lars Lowe Sjösund                                 Piyush Patil
Sébastien Cevey                                   Ankesh Anand
Zach Gleicher                                     Roman Ring
Thi Avrahami                                      Katerina Tsihlas
Anudhyan Boral                                    Arpi Vezer
Hansa Srinivasan                                  Marco Selvi
Vittorio Selo                                     Toby Shevlane
Rhys May                                          Mikel Rodriguez
Konstantinos Aisopos                              Tom Kwiatkowski
Léonard Hussenot                                  Samira Daruki
Livio Baldini Soares                              Keran Rong
Kate Baumli                                       Allan Dafoe
Michael B. Chang                                  Nicholas FitzGerald
Adrià Recasens                                    Keren Gu-Lemberg
Ben Caine                                         Mina Khan
Alexander Pritzel                                 Lisa Anne Hendricks
Filip Pavetic                                     Marie Pellat
Fabio Pardo                                       Vladimir Feinberg
Anita Gergely                                     James Cobon-Kerr
Justin Frye                                       Tara Sainath
Vinay Ramasesh                                    Maribeth Rauh
Dan Horgan                                        Sayed Hadi Hashemi
Kartikeya Badola                                  Richard Ives
Nora Kassner                                      Yana Hasson
Subhrajit Roy                                     Eric Noland



                                                                              59
                        Gemini: A Family of Highly Capable Multimodal Models



Core Contributors                                  Core Contributors
Yuan Cao                                           Jonah Joughin
Nathan Byrd                                        Egor Filonov
Le Hou                                             Tomasz Kępa
Qingze Wang                                        Yomna Eldawy
Thibault Sottiaux                                  Jiawern Lim
Michela Paganini                                   Rahul Rishi
Jean-Baptiste Lespiau                              Shirin Badiezadegan
Alexandre Moufarek                                 Taylor Bos
Samer Hassan                                       Jerry Chang
Kaushik Shivakumar                                 Sanil Jain
Joost van Amersfoort                               Sri Gayatri Sundara Padmanabhan
Amol Mandhane                                      Subha Puttagunta
Pratik Joshi                                       Kalpesh Krishna
Anirudh Goyal                                      Leslie Baker
Matthew Tung                                       Norbert Kalb
Andrew Brock                                       Vamsi Bedapudi
Hannah Sheahan                                     Adam Kurzrok
Vedant Misra                                       Shuntong Lei
Cheng Li                                           Anthony Yu
Nemanja Rakićević                                  Oren Litvin
Mostafa Dehghani                                   Xiang Zhou
Fangyu Liu                                         Zhichun Wu
Sid Mittal                                         Sam Sobell
Junhyuk Oh                                         Andrea Siciliano
Seb Noury                                          Alan Papir
Eren Sezener                                       Robby Neale
Fantine Huot                                       Jonas Bragagnolo
Matthew Lamm                                       Tej Toor
Nicola De Cao                                      Tina Chen
Charlie Chen                                       Valentin Anklin
Sidharth Mudgal                                    Feiran Wang
Romina Stella                                      Richie Feng
Kevin Brooks                                       Milad Gholami
Gautam Vasudevan                                   Kevin Ling
Chenxi Liu                                         Lijuan Liu
Mainak Chain                                       Jules Walter
Nivedita Melinkeri                                 Hamid Moghaddam
Aaron Cohen                                        Arun Kishore
Venus Wang                                         Jakub Adamek
Kristie Seymore                                    Tyler Mercado
Sergey Zubkov                                      Jonathan Mallinson
Rahul Goel                                         Siddhinita Wandekar
Summer Yue                                         Stephen Cagle
Sai Krishnakumaran                                 Eran Ofek
Brian Albert                                       Guillermo Garrido
Nate Hurley                                        Clemens Lombriser
Motoki Sano                                        Maksim Mukha
Anhad Mohananey                                    Botu Sun



                                                                                     60
                      Gemini: A Family of Highly Capable Multimodal Models



Core Contributors                                Core Contributors
Hafeezul Rahman Mohammad                         Elico Teixeira
Josip Matak                                      Matthew Fritze
Yadi Qian                                        Francesco Bertolini
Vikas Peswani                                    Liana-Eleonora Marinescu
Pawel Janus                                      Martin Bölle
Quan Yuan                                        Dominik Paulus
Leif Schelin                                     Khyatti Gupta
Oana David                                       Tejasi Latkar
Ankur Garg                                       Max Chang
Yifan He                                         Jason Sanders
Oleksii Duzhyi                                   Roopa Wilson
Anton Älgmyr                                     Xuewei Wu
Timothée Lottaz                                  Yi-Xuan Tan
Qi Li                                            Lam Nguyen Thiet
Vikas Yadav                                      Tulsee Doshi
Luyao Xu                                         Sid Lall
Alex Chinien                                     Swaroop Mishra
Rakesh Shivanna                                  Wanming Chen
Aleksandr Chuklin                                Thang Luong
Josie Li                                         Seth Benjamin
Carrie Spadine                                   Jasmine (Sun Jae) Lee
Travis Wolfe                                     Ewa Andrejczuk
Kareem Mohamed                                   Dominik Rabiej
Subhabrata Das                                   Vipul Ranjan
Zihang Dai                                       Krzysztof Styrc
Kyle He                                          Pengcheng Yin
Daniel von Dincklage                             Jon Simon
Shyam Upadhyay                                   Malcolm Rose Harriott
Akanksha Maurya                                  Mudit Bansal
Luyan Chi                                        Alexei Robsky
Sebastian Krause                                 Geoff Bacon
Khalid Salama                                    David Greene
Pam G Rabinovitch                                Daniil Mirylenka
Pavan Kumar Reddy M                              Chen Zhou
Aarush Selvan                                    Obaid Sarvana
Mikhail Dektiarev                                Abhimanyu Goyal
Golnaz Ghiasi                                    Samuel Andermatt
Erdem Guven                                      Patrick Siegler
Himanshu Gupta                                   Ben Horn
Boyi Liu                                         Assaf Israel
Deepak Sharma                                    Francesco Pongetti
Idan Heimlich Shtacher                           Chih-Wei “Louis” Chen
Shachi Paul                                      Marco Selvatici
Oscar Akerlund                                   Pedro Silva
François-Xavier Aubet                            Kathie Wang
Terry Huang                                      Jackson Tolins
Chen Zhu                                         Kelvin Guu
Eric Zhu                                         Roey Yogev



                                                                             61
                             Gemini: A Family of Highly Capable Multimodal Models



Core Contributors                                       Core Contributors
Xiaochen Cai                                            Sahitya Potluri
Alessandro Agostini                                     Preethi Lahoti
Maulik Shah                                             Cip Baetu
Hung Nguyen                                             Ali Ghorbani
Noah Ó Donnaile                                         Charles Chen
Sébastien Pereira                                       Andy Crawford
Linda Friso                                             Shalini Pal
Adam Stambler                                           Mukund Sridhar
Adam Kurzrok                                            Petru Gurita
Chenkai Kuang                                           Asier Mujika
Yan Romanikhin                                          Igor Petrovski
Mark Geller                                             Pierre-Louis Cedoz
ZJ Yan                                                  Chenmei Li
Kane Jang                                               Shiyuan Chen
Cheng-Chun Lee                                          Niccolò Dal Santo
Wojciech Fica                                           Siddharth Goyal
Eric Malmi                                              Jitesh Punjabi
Qijun Tan                                               Karthik Kappaganthu
Dan Banica                                              Chester Kwak
Daniel Balle                                            Pallavi LV
Ryan Pham                                               Sarmishta Velury
Yanping Huang                                           Himadri Choudhury
Diana Avram                                             Jamie Hall
Hongzhi Shi                                             Premal Shah
Jasjot Singh                                            Ricardo Figueira
Chris Hidey                                             Matt Thomas
Niharika Ahuja                                          Minjie Lu
Pranab Saxena                                           Ting Zhou
Dan Dooley                                              Chintu Kumar
Srividya Pranavi Potharaju                              Thomas Jurdi
Eileen O’Neill                                          Sharat Chikkerur
Anand Gokulchandran                                     Yenai Ma
Ryan Foley                                              Adams Yu
Kai Zhao                                                Soo Kwak
Mike Dusenberry                                         Victor Ähdel
Yuan Liu                                                Sujeevan Rajayogam
Pulkit Mehta                                            Travis Choma
Ragha Kotikalapudi                                      Fei Liu
Chalence Safranek-Shrader                               Aditya Barua
Andrew Goodman                                          Colin Ji
Joshua Kessinger                                        Ji Ho Park
Eran Globen                                             Vincent Hellendoorn
Prateek Kolhar                                          Alex Bailey
Chris Gorgolewski                                       Taylan Bilal
Ali Ibrahim                                             Huanjie Zhou
Yang Song                                               Mehrdad Khatir
Ali Eichenbaum                                          Charles Sutton
Thomas Brovelli                                         Wojciech Rzadkowski



                                                                                    62
                           Gemini: A Family of Highly Capable Multimodal Models



Core Contributors                                     Contributors
Fiona Macintosh                                       Nan Hua
Roopali Vij                                           Geoffrey Cideron
Konstantin Shagin                                     Edouard Leurent
Paul Medina                                           Mahmoud Alnahlawi
Chen Liang                                            Ionut Georgescu
Jinjing Zhou                                          Nan Wei
Pararth Shah                                          Ivy Zheng
Yingying Bi                                           Dylan Scandinaro
Attila Dankovics                                      Heinrich Jiang
Shipra Banga                                          Jasper Snoek
Sabine Lehmann                                        Mukund Sundararajan
Marissa Bredesen                                      Xuezhi Wang
Zifan Lin                                             Zack Ontiveros
John Eric Hoffmann                                    Itay Karo
Jonathan Lai                                          Jeremy Cole
Raynald Chung                                         Vinu Rajashekhar
Kai Yang                                              Lara Tumeh
Nihal Balani                                          Eyal Ben-David
Arthur Bražinskas                                     Rishub Jain
Andrei Sozanschi                                      Jonathan Uesato
Matthew Hayes                                         Romina Datta
Héctor Fernández Alcalde                              Oskar Bunyan
Peter Makarov                                         Shimu Wu
Will Chen                                             John Zhang
Antonio Stella                                        Piotr Stanczyk
Liselotte Snijders                                    Ye Zhang
Michael Mandl                                         David Steiner
Ante Kärrman                                          Subhajit Naskar
Paweł Nowak                                           Michael Azzam
Xinyi Wu                                              Matthew Johnson
Alex Dyck                                             Adam Paszke
Krishnan Vaidyanathan                                 Chung-Cheng Chiu
Raghavender R                                         Jaume Sanchez Elias
Jessica Mallet                                        Afroz Mohiuddin
Mitch Rudominer                                       Faizan Muhammad
Eric Johnston                                         Jin Miao
Sushil Mittal                                         Andrew Lee
Akhil Udathu                                          Nino Vieillard
Janara Christensen                                    Jane Park
Vishal Verma                                          Jiageng Zhang
Zach Irving                                           Jeff Stanway
Andreas Santucci                                      Drew Garmon
                                                      Abhijit Karmarkar
Contributors                                          Zhe Dong
Gamaleldin Elsayed                                    Jong Lee
Elnaz Davoodi                                         Aviral Kumar
Marin Georgiev                                        Luowei Zhou
Ian Tenney                                            Jonathan Evens



                                                                                  63
                       Gemini: A Family of Highly Capable Multimodal Models



Contributors                                      Contributors
William Isaac                                     Abhishek Chakladar
Geoffrey Irving                                   Ginger Perng
Edward Loper                                      Elena Allica Abellan
Michael Fink                                      Mingyang Zhang
Isha Arkatkar                                     Ishita Dasgupta
Nanxin Chen                                       Nate Kushman
Izhak Shafran                                     Ivo Penchev
Ivan Petrychenko                                  Alena Repina
Zhe Chen                                          Xihui Wu
Johnson Jia                                       Tom van der Weide
Anselm Levskaya                                   Priya Ponnapalli
Zhenkai Zhu                                       Caroline Kaplan
Peter Grabowski                                   Jiri Simsa
Yu Mao                                            Shuangfeng Li
Alberto Magni                                     Olivier Dousse
Kaisheng Yao                                      Fan Yang
Javier Snaider                                    Jeff Piper
Norman Casagrande                                 Nathan Ie
Evan Palmer                                       Rama Pasumarthi
Paul Suganthan                                    Nathan Lintz
Alfonso Castaño                                   Anitha Vijayakumar
Irene Giannoumis                                  Daniel Andor
Wooyeol Kim                                       Pedro Valenzuela
Mikołaj Rybiński                                  Minnie Lui
Ashwin Sreevatsa                                  Cosmin Paduraru
Jennifer Prendki                                  Daiyi Peng
David Soergel                                     Katherine Lee
Adrian Goedeckemeyer                              Shuyuan Zhang
Willi Gierke                                      Somer Greene
Mohsen Jafari                                     Duc Dung Nguyen
Meenu Gaba                                        Paula Kurylowicz
Jeremy Wiesner                                    Cassidy Hardin
Diana Gage Wright                                 Lucas Dixon
Yawen Wei                                         Lili Janzer
Harsha Vashisht                                   Kiam Choo
Yana Kulizhskaya                                  Ziqiang Feng
Jay Hoover                                        Biao Zhang
Maigo Le                                          Achintya Singhal
Lu Li                                             Dayou Du
Chimezie Iwuanyanwu                               Dan McKinnon
Lu Liu                                            Natasha Antropova
Kevin Ramirez                                     Tolga Bolukbasi
Andrey Khorlin                                    Orgad Keller
Albert Cui                                        David Reid
Tian LIN                                          Daniel Finchelstein
Marcus Wu                                         Maria Abi Raad
Ricardo Aguilar                                   Remi Crocker
Keith Pallo                                       Peter Hawkins



                                                                              64
                         Gemini: A Family of Highly Capable Multimodal Models



Contributors                                        Contributors
Robert Dadashi                                      John Mellor
Colin Gaffney                                       Abhishek Sharma
Ken Franko                                          Kathy Wu
Anna Bulanova                                       David Miller
Rémi Leblond                                        Nicolas Sonnerat
Shirley Chung                                       Denis Vnukov
Harry Askham                                        Rory Greig
Luis C. Cobo                                        Jennifer Beattie
Kelvin Xu                                           Emily Caveness
Felix Fischer                                       Libin Bai
Jun Xu                                              Julian Eisenschlos
Christina Sorokin                                   Alex Korchemniy
Chris Alberti                                       Tomy Tsai
Chu-Cheng Lin                                       Mimi Jasarevic
Colin Evans                                         Weize Kong
Alek Dimitriev                                      Phuong Dao
Hannah Forbes                                       Zeyu Zheng
Dylan Banarse                                       Frederick Liu
Zora Tung                                           Fan Yang
Mark Omernick                                       Rui Zhu
Colton Bishop                                       Tian Huey Teh
Rachel Sterneck                                     Jason Sanmiya
Rohan Jain                                          Evgeny Gladchenko
Jiawei Xia                                          Nejc Trdin
Ehsan Amid                                          Daniel Toyama
Francesco Piccinno                                  Evan Rosen
Xingyu Wang                                         Sasan Tavakkol
Praseem Banzal                                      Linting Xue
Daniel J. Mankowitz                                 Chen Elkind
Alex Polozov                                        Oliver Woodman
Victoria Krakovna                                   John Carpenter
Sasha Brown                                         George Papamakarios
MohammadHossein Bateni                              Rupert Kemp
Dennis Duan                                         Sushant Kafle
Vlad Firoiu                                         Tanya Grunina
Meghana Thotakuri                                   Rishika Sinha
Tom Natan                                           Alice Talbert
Matthieu Geist                                      Diane Wu
Sertan Girgin                                       Denese Owusu-Afriyie
Hui Li                                              Cosmo Du
Jiayu Ye                                            Chloe Thornton
Ofir Roval                                          Jordi Pont-Tuset
Reiko Tojo                                          Pradyumna Narayana
Michael Kwong                                       Jing Li
James Lee-Thorp                                     Saaber Fatehi
Christopher Yew                                     John Wieting
Danila Sinopalnikov                                 Omar Ajmeri
Sabela Ramos                                        Benigno Uria



                                                                                65
                           Gemini: A Family of Highly Capable Multimodal Models



Contributors                                          Contributors
Yeongil Ko                                            Prakash Shroff
Laura Knight                                          Mani Varadarajan
Amélie Héliou                                         Sanaz Bahargam
Ning Niu                                              Rob Willoughby
Shane Gu                                              David Gaddy
Chenxi Pang                                           Guillaume Desjardins
Yeqing Li                                             Marco Cornero
Nir Levine                                            Brona Robenek
Ariel Stolovich                                       Bhavishya Mittal
Rebeca Santamaria-Fernandez                           Ben Albrecht
Sonam Goenka                                          Ashish Shenoy
Wenny Yustalim                                        Fedor Moiseev
Robin Strudel                                         Henrik Jacobsson
Ali Elqursh                                           Alireza Ghaffarkhah
Charlie Deck                                          Morgane Rivière
Hyo Lee                                               Alanna Walton
Zonglin Li                                            Clément Crepy
Kyle Levin                                            Alicia Parrish
Raphael Hoffmann                                      Zongwei Zhou
Dan Holtmann-Rice                                     Clement Farabet
Olivier Bachem                                        Carey Radebaugh
Sho Arora                                             Praveen Srinivasan
Christy Koh                                           Claudia van der Salm
Soheil Hassas Yeganeh                                 Andreas Fidjeland
Siim Põder                                            Salvatore Scellato
Mukarram Tariq                                        Eri Latorre-Chimoto
Yanhua Sun                                            Hanna Klimczak-Plucińska
Lucian Ionita                                         David Bridson
Mojtaba Seyedhosseini                                 Dario de Cesare
Pouya Tafti                                           Tom Hudson
Zhiyu Liu                                             Piermaria Mendolicchio
Anmol Gulati                                          Lexi Walker
Jasmine Liu                                           Alex Morris
Xinyu Ye                                              Matthew Mauger
Bart Chrzaszcz                                        Alexey Guseynov
Lily Wang                                             Alison Reid
Nikhil Sethi                                          Seth Odoom
Tianrun Li                                            Lucia Loher
Ben Brown                                             Victor Cotruta
Shreya Singh                                          Madhavi Yenugula
Wei Fan                                               Dominik Grewe
Aaron Parisi                                          Anastasia Petrushkina
Joe Stanton                                           Tom Duerig
Vinod Koverkathu                                      Antonio Sanchez
Christopher A. Choquette-Choo                         Steve Yadlowsky
Yunjie Li                                             Amy Shen
TJ Lu                                                 Amir Globerson
Abe Ittycheriah                                       Lynette Webb



                                                                                  66
                         Gemini: A Family of Highly Capable Multimodal Models



Contributors                                        Contributors
Sahil Dua                                           Jason Gelman
Dong Li                                             Yang Xu
Surya Bhupatiraju                                   George Polovets
Dan Hurt                                            Ji Liu
Haroon Qureshi                                      Honglong Cai
Ananth Agarwal                                      Warren Chen
Tomer Shani                                         XiangHai Sheng
Matan Eyal                                          Emily Xue
Anuj Khare                                          Sherjil Ozair
Shreyas Rammohan Belle                              Christof Angermueller
Lei Wang                                            Xiaowei Li
Chetan Tekur                                        Anoop Sinha
Mihir Sanjay Kale                                   Weiren Wang
Jinliang Wei                                        Julia Wiesinger
Ruoxin Sang                                         Emmanouil Koukoumidis
Brennan Saeta                                       Yuan Tian
Tyler Liechty                                       Anand Iyer
Yi Sun                                              Madhu Gurumurthy
Yao Zhao                                            Mark Goldenson
Stephan Lee                                         Parashar Shah
Pandu Nayak                                         MK Blake
Doug Fritz                                          Hongkun Yu
Manish Reddy Vuyyuru                                Anthony Urbanowicz
John Aslanides                                      Jennimaria Palomaki
Nidhi Vyas                                          Chrisantha Fernando
Martin Wicke                                        Ken Durden
Xiao Ma                                             Harsh Mehta
Evgenii Eltyshev                                    Nikola Momchev
Nina Martin                                         Elahe Rahimtoroghi
Hardie Cate                                         Maria Georgaki
James Manyika                                       Amit Raul
Keyvan Amiri                                        Sebastian Ruder
Yelin Kim                                           Morgan Redshaw
Xi Xiong                                            Jinhyuk Lee
Kai Kang                                            Denny Zhou
Florian Luisier                                     Komal Jalan
Nilesh Tripuraneni                                  Dinghua Li
David Madras                                        Blake Hechtman
Mandy Guo                                           Parker Schuh
Austin Waters                                       Milad Nasr
Oliver Wang                                         Kieran Milan
Joshua Ainslie                                      Vladimir Mikulik
Jason Baldridge                                     Juliana Franco
Han Zhang                                           Tim Green
Garima Pruthi                                       Nam Nguyen
Jakob Bauer                                         Joe Kelley
Feng Yang                                           Aroma Mahendru
Riham Mansour                                       Andrea Hu



                                                                                67
                           Gemini: A Family of Highly Capable Multimodal Models



Contributors                                          Contributors
Joshua Howland                                        Vít Listík
Ben Vargas                                            Mathias Carlen
Jeffrey Hui                                           Jan van de Kerkhof
Kshitij Bansal                                        Marcin Pikus
Vikram Rao                                            Krunoslav Zaher
Rakesh Ghiya                                          Paul Müller
Emma Wang                                             Sasha Zykova
Ke Ye                                                 Richard Stefanec
Jean Michel Sarr                                      Vitaly Gatsko
Melanie Moranski Preston                              Christoph Hirnschall
Madeleine Elish                                       Ashwin Sethi
Steve Li                                              Xingyu Federico Xu
Aakash Kaku                                           Chetan Ahuja
Jigar Gupta                                           Beth Tsai
Ice Pasupat                                           Anca Stefanoiu
Da-Cheng Juan                                         Bo Feng
Milan Someswar                                        Keshav Dhandhania
Tejvi M.                                              Manish Katyal
Xinyun Chen                                           Akshay Gupta
Aida Amini                                            Atharva Parulekar
Alex Fabrikant                                        Divya Pitta
Eric Chu                                              Jing Zhao
Xuanyi Dong                                           Vivaan Bhatia
Amruta Muthal                                         Yashodha Bhavnani
Senaka Buthpitiya                                     Omar Alhadlaq
Sarthak Jauhari                                       Xiaolin Li
Nan Hua                                               Peter Danenberg
Urvashi Khandelwal                                    Dennis Tu
Ayal Hitron                                           Alex Pine
Jie Ren                                               Vera Filippova
Larissa Rinaldi                                       Abhipso Ghosh
Shahar Drath                                          Ben Limonchik
Avigail Dabush                                        Bhargava Urala
Nan-Jiang Jiang                                       Chaitanya Krishna Lanka
Harshal Godhia                                        Derik Clive
Uli Sachs                                             Yi Sun
Anthony Chen                                          Edward Li
Yicheng Fan                                           Hao Wu
Hagai Taitelbaum                                      Kevin Hongtongsak
Hila Noga                                             Ianna Li
Zhuyun Dai                                            Kalind Thakkar
James Wang                                            Kuanysh Omarov
Chen Liang                                            Kushal Majmundar
Jenny Hamer                                           Michael Alverson
Chun-Sung Ferng                                       Michael Kucharski
Chenel Elkind                                         Mohak Patel
Aviel Atias                                           Mudit Jain
Paulina Lee                                           Maksim Zabelin



                                                                                  68
                                   Gemini: A Family of Highly Capable Multimodal Models



Contributors                                                  Gemini App Program Leads
Paolo Pelagatti                                               Amar Subramanya7
Rohan Kohli                                                   Sissie Hsiao
Saurabh Kumar
Joseph Kim                                                    Gemini Program Leads
Swetha Sankar                                                 Demis Hassabis
Vineet Shah                                                   Koray Kavukcuoglu
Lakshmi Ramachandruni
Xiangkai Zeng                                                 Overall Gemini App Technical Leads
Ben Bariach                                                   Adam Sadovsky8
Laura Weidinger                                               Quoc Le
Tu Vu                                                         Trevor Strohman9
Alek Andreev                                                  Yonghui Wu10
Antoine He
Kevin Hui                                                     Overall Gemini Post-Training Lead
Sheleem Kashem                                                Slav Petrov

                                                              Overall Gemini Technical Leads (equal con-
                                                              tribution)
                                                              Jeffrey Dean
                                                              Oriol Vinyals




The roles are defined as below:

   • Lead: Individual(s) responsible for the sub-team throughout the project.
   • Core Contributor: Individual that had significant impact throughout the project.
   • Contributor: Individual that had contributions to the project and was partially involved with the
     effort.
   • Program Lead: Responsible for the organizational aspects of the Gemini effort.
   • Overall Post-Training Lead: Responsible for the technical direction of post-training.
   • Overall Technical Lead: Responsible for the technical direction of the overall Gemini effort.

   Within each role, contributions are equal, and are listed in a randomized order. Ordering within
each role does not indicate ordering of the contributions.
   Gemini is a cross-Google effort, with members from Google DeepMind (GDM), Google Research
(GR), Bard/Assistant, Knowledge and Information (K&I), Core ML, Cloud, Labs, and more.
    We thank Aakanksha Chowdhery, Dustin Tran, Heng-Tze Cheng, Jack W. Rae, Kate Olszewska,
Mariko Iinuma, Peter Humphreys, Shashi Narayan, and Steven Zheng for leading the preparation of
this report. We also thank our reviewers and colleagues for their valuable discussions and feedback
on the report — Alexandra Belias, Ana Ramalho, Anand Rao, Arielle Bier, Danielle Landress, Eleanor
Tomlinson, Emily Hossellman, Gaby Pearl, Helen King, Hollie Dobson, Jaclyn Konzelmann, Jennifer
  7 Lead, Gemini App Engineering
  8 Lead, Gemini App Core Modeling, Eval, Data
  9Co-Lead, Gemini App Serving
 10Co-Lead, Gemini Text



                                                                                                      69
                              Gemini: A Family of Highly Capable Multimodal Models



Beroshi, Joel Moss, Jon Small, Jonathan Fildes, Kathy Meier-Hellstern, Lisa Patel, Oli Gaymond,
Rebecca Bland, Reena Jana, Tessa Lueth, and Tom Lue.
    Our work is made possible by the dedication and efforts of numerous teams at Google. We would
like to acknowledge the support from Abhi Mohan, Adekunle Bello, Aishwarya Nagarajan, Alaa
Saade, Alejandro Lince, Alexander Chen, Alexander Kolbasov, Alexander Schiffhauer, Ameya Shringi,
Amin Vahdat, Anda Rabatić, Anthonie Gross, Antoine Yang, Anthony Green, Anton Ruddock, Art
Khurshudov, Artemis Chen, Arthur Argenson, Avinatan Hassidim, Beiye Liu, Benjamin Schroeder,
Bin Ni, Brett Daw, Bryan Chiang, Burak Gokturk, Carl Crous, Carrie Grimes Bostock, Charbel Kaed,
Charlotte Banks, Che Diaz, Chris Larkin, Christy Lian, Claire Cui, Clare Bycroft, Corentin Tallec,
Daniel Herndon, Dave Burke, David Battle, David Engel, Dipannita Shaw, Donghyun Koo, Doug
Ritchie, Dragos Stefanescu, Elissa Wolf, Emre Sargin, Eric Herren, Estella King, Fatema Alkhanaizi,
Felix Gimeno, Fernando Pereira, Florent Altché, Gabriel Carvajal, Gaurav Gandhi, George Powell,
Goran Pavičić, Harry Richardson, Hassan Wassel, Hongji Li, Idan Szpektor, Igor Ivanisevic, Ivan
Jambrešić, Ivan Jurin, Jade Fowler, James Assiene, Jay Yagnik, Jean-bastien Grill, Jeff Seibert, Jenna
LaPlante, Jessica Austin, Jianxing Lu, Jim O’Keeffe, Jin Huang, Joe Heyward, Johannes Welbl, John
Jumper, Jonathan Caton, Josh Woodward, Joshua Foster, Kathryn Tunyasuvunakool, Katrina Wong,
Kavya Kopparapu, Kelvin Nguyen, Kira Yin, Konstantin Sharlaimov, Kun Li, Lee Hong, Lilly Taylor,
Longfei Shen, Luc Mercier, Maciej Mikuła, Mania Abdi, Manuel Sanchez, Maria Ines Aranguren,
Mario Carlos Cortes III, Matthew Tait, Matthias Lochbrunner, Mehdi Ghissassi, Micah Mosley, Michael
Bendersky, Michael Figurnov, Michael Harris, Michael Mathieu, Michael O’Neill, Michael Vorburger,
Mihir Paradkar, Nandita Dukkipati, Nathan Carter, Nathan Watson, Neil Rabinowitz, Nikhil Dandekar,
Nishant Ranka, Olcan Sercinoglu, Olivier Lacombe, Ottavia Bertolli, Paul Caron, Pranesh Srinivasan,
Praveen Kumar, Rahul Sukthankar, Raia Hadsell, Rajagopal Ananthanarayanan, Roberto Lupi, Rosie
Zou, Sachin Menezes, Sadegh Jazayeri, Sam Cheung, Sameer Bidichandani, Sania Alex, Sanjiv
Kumar, Sara Wiltberger, Sarah Fitzgerald, Saz Basu, Sebastian Nowozin, Shannon Hepburn, Shayne
Cardwell,Srinivasan Venkatachary, Sugato Basu, Sundar Pichai, Sundeep Tirumalareddy, Susannah
Young, Swetha Vijayaraghavan, Tania Bedrax-Weiss, Taylor Applebaum, Teiva Harsanyi, Terry Chen,
Tim Blyth, Ting Liu, Tom Cobley, Tomas Izo, Trystan Upstill, Varun Singhai, Vedrana Klarić Trupčević,
Victor Cai, Vladimir Pudovkin, Vu Dang, Wenbo Zhao, Wesley Crow, Wesley Szeng, Xiaodan Song,
Yazhou Zu, Ye Tian, Yicong Wang, Yixing Wang, Yossi Matias, Yunlong Jiao, Zachary Jessup, Zhenchuan
Pang, Žiga Avsec, Zimeng Yang, and Zoubin Ghahramani. We’d also like to recognize the AlphaCode
team, the Borg Scheduling team, the Facilities team, the Gemini Demo Team, the Global Server Ops
(GSO) team, the JAX team, the the Legal team, ML SRE team, the ML Supercomputer (MLSC) team,
the PartIR team, the Platforms Infrastructure Engineering (PIE) team, and the XLA Compiler team.
   We thank everyone at Google not explicitly mentioned above, who have shared excitement, given
feedback on early Gemini models or created interesting demo uses of Gemini, and worked with or
supported the core Gemini team on many aspects of this project.




                                                                                                   70
                         Gemini: A Family of Highly Capable Multimodal Models



10. Appendix

10.1. Gemini Ultra Model Card

                                        Model summary
   Model architecture        Gemini V1.0 is a new family of state-of-the-art language models,
                             containing variants known as Nano, Pro and Ultra (ordered
                             by parameter count) based on a decoder-only Transformer
                             architecture (Vaswani et al., 2017a). Models are trained to
                             support 32K context length, employing efficient attention
                             mechanisms such as multi-query attention (Shazeer, 2019b).
                             Gemini is trained jointly across image, audio, video and text
                             data for the purpose of building a model with both strong
                             generalist capabilities across modalities alongside cutting-edge
                             understanding and reasoning performance in each respective
                             domain.

                             The post-trained models described in this model card
                             are Gemini API and Gemini Apps model variants (Section 6)
                             built on top of the Gemini Ultra pre-trained model. During the
                             post-training process, additional architectural modifications are
                             also made to support the training of multi-objective reward
                             models for RLHF.
   Input(s)                  Text (e.g. a question, a prompt, a document(s) to be summa-
                             rized), images, video, audio files.
   Output(s)                 Generated text in response to the input (e.g. an answer to
                             the question, a summary of multiple documents, comparing
                             documents/videos).
                                               Usage
   Application               Gemini is designed for accelerating research on language
                             models, for use as a building block in features within Google
                             products, and as a building block for select applications such as
                             Gemini App and Search Generative Experience.

                             Services and products built on top of Gemini Ultra are
                             also being made available to external developers via Google
                             Cloud Vertex API and Google Labs, with additional process and
                             technical safeguards related to safety policies.
   Known Caveats             Gemini should not be made available as part of a general-purpose
                             service or product, or used within a specific downstream appli-
                             cation without a prior assessment and mitigation of the safety
                             and fairness concerns specific to the downstream use.




                                                                                                 71
                       Gemini: A Family of Highly Capable Multimodal Models




                             Implementation Frameworks
Hardware & Software        Hardware: Training was conducted on TPUv4 and TPUv5e
                           (Jouppi et al., 2020, 2023).

                           Software: JAX (Bradbury et al., 2018), ML Pathways
                           (Dean, 2021).

                           JAX allows researchers to leverage the latest generation
                           of hardware, including TPUs, for faster and more efficient
                           training of large models.

                           ML Pathways is infrastructure software to support Google’s
                           efforts to build artificially intelligent systems capable of
                           generalizing across multiple tasks. This is specially suitable for
                           foundation models, including large language models like the
                           Gemini V1.0 models.

                           Together, JAX and ML Pathways are used as described in
                           Section 3. The ’single controller’ programming model of
                           JAX and ML Pathways allows a single Python process to
                           orchestrate the entire training run, dramatically simplifying the
                           development workflow.
Compute Requirements       Not reported.
                                  Model Characteristics
Model initialization       Initial pretraining used random initialization. Post-training was
                           initialized from checkpoints obtained at the later stages of pre-
                           training. These checkpoints were fine-tuned using supervised
                           fine-tuning, and subsequently used to initialize reward model
                           training and RLHF.
Model Status               This is a static model trained on an offline dataset.
Model Stats                Not reported.
                                       Data overview
Training Dataset           Gemini models are trained on a dataset that is both multimodal
                           and multilingual. Our pre-training dataset uses data from web
                           documents, books, and code, and includes image, audio, and
                           video data.

                           Refer to Section 4 (Pre-Training Dataset) for further de-
                           tails.




                                                                                                72
                             Gemini: A Family of Highly Capable Multimodal Models



   Evaluation Dataset            We compare pre- and post-trained Gemini Ultra models to a
                                 suite of external LLMs and our previous best model PaLM 2
                                 across a series of text-based academic benchmarks covering
                                 reasoning, reading comprehension, STEM, and coding.

                                 We also evaluate Gemini models on four different mul-
                                 timodal capabilities: high-level object recognition using
                                 captioning or question-answering tasks such as VQAv2; fine-
                                 grained transcription using tasks such as TextVQA and DocVQA
                                 requiring the model to recognize low-level details; chart
                                 understanding requiring spatial understanding of input layout
                                 using ChartQA and InfographicVQA tasks; and multimodal
                                 reasoning using tasks such as Ai2D, MathVista and MMMU.

                                 Refer to Section 5 (Evaluation) for further details.
   Post-training Dataset         For post-training, we first collect a diverse set of prompts that
                                 are representative of real-world use cases. We then collect
                                 demonstration data of what the model’s output should be for
                                 a given prompt for supervised fine-tuning. We further collect
                                 different possible responses to a given prompt, and collect
                                 feedback data over these to train reward models.

                                 Refer to Section 6.3 (Post-Training Methods and Data)
                                 for further details.
                                          Evaluation Results
   Benchmark Information         See Section 5 (Evaluation).
   Evaluation Results            See Section 5 (Evaluation) and Section 6.4 (Post-Training Hu-
                                 man Evaluation).
                                    Model Usage & Limitations
   Sensitive Use                 For an analysis of risks and sensitive uses associated with the
                                 Gemini models, see Section 7.1 (Impact Assessment).
   Known Limitations             Gemini models can exhibit limitations outlined in Section 7.1
                                 (Impact Assessment). Gemini models should not be used for
                                 downstream applications without further analysis of potential
                                 harm in the proposed downstream application.
   Ethical Considerations &      A reflection on the potential risks and impacts of the Gemini V1.0
   Risks                         models can be found in Section 7 (Responsible Deployment).
                                 For evaluation details for a range of risks, see Section 7.4 (Safety
                                 Evaluations).



10.2. Chain-of-Thought Comparisons on MMLU benchmark

We contrast several chain-of-thought approaches on MMLU and discuss their results in this section. We
proposed a new approach where model produces k chain-of-thought samples, selects the majority vote
if the model is confident above a threshold, and otherwise defers to the greedy sample choice. The
                                                                                                        73
                                                      Gemini: A Family of Highly Capable Multimodal Models



thresholds are optimized for each model based on their validation split performance. The proposed
approach is referred to as uncertainty-routed chain-of-thought. The intuition behind this approach
is that chain-of-thought samples might degrade performance compared to the maximum-likelihood
decision when the model is demonstrably inconsistent. We compare the gains from the proposed
approach on both Gemini Ultra and GPT-4 in Figure 9. We find that Gemini Ultra benefits more from
this approach compared to using only chain-of-thought samples. GPT-4’s performance improves from
84.2% with greedy sampling to 87.3% with uncertainty-routed chain-of-thought approach with 32
samples, but it already achieves these gains from using 32 chain-of-thought samples. In contrast,
Gemini Ultra improves its performance significantly from 84.0% with greedy sampling to 90.0% with
uncertainty-routed chain-of-thought approach with 32 samples while it marginally improves to 85.0%
with the use of 32 chain-of-thought samples only.

                                                                 GPT-4 (gpt-4-0613)     Gemini Ultra

                                         90
                                                                                                                    90.04
                                                                           87.29                        87.29
                                         80   84.21      83.96                        84.99




            MMLU accuracy (test split)
                                         70

                                         60

                                         50

                                         40

                                         30

                                         20

                                         10

                                          0
                                                 Score Eval               Chain-of-Thought@32          Chain-of-Thought@32
                                                                                                       (Uncertainty-Routed)



Figure 9 | Chain-of-Thought with uncertainty routing on MMLU.




                                                                                                                              74
                               Gemini: A Family of Highly Capable Multimodal Models



10.3. Capabilities and Benchmarking Tasks

We use more than 50 benchmarks as a holistic harness to evaluate the Gemini models across text,
image, audio and video. We provide a detailed list of benchmarking tasks for six different capabilities in
text understanding and generation: factuality, long context, math/science, reasoning, summarization,
and multilinguality. We also enumerate the benchmarks used for image understanding, video
understanding, and audio understanding tasks.

   • Factuality: We use 5 benchmarks: BoolQ (Clark et al., 2019), NaturalQuestions-Closed
     (Kwiatkowski et al., 2019a), NaturalQuestions-Retrieved (Kwiatkowski et al., 2019a), Real-
     timeQA (Kasai et al., 2022b), TydiQA-noContext and TydiQA-goldP (Clark et al., 2020).
   • Long Context: We use 6 benchmarks: NarrativeQA (Kočiský et al., 2018), Scrolls-Qasper,
     Scrolls-Quality (Shaham et al., 2022), XLsum (En), XLSum (non-English languages) (Hasan
     et al., 2021), and one other internal benchmark.
   • Math/Science: We use 8 benchmarks: GSM8k (with CoT) (Cobbe et al., 2021), Hendryck’s
     MATH pass@1 (Hendrycks et al., 2021b), MMLU (Hendrycks et al., 2021a), Math-StackExchange,
     Math-AMC 2022-2023 problems, and three other internal benchmarks.
   • Reasoning: We use 7 benchmarks: BigBench Hard (with CoT) (Srivastava et al., 2022; Suzgun
     et al., 2022), CLRS (Veličković et al., 2022), Proof Writer (Tafjord et al., 2020), Reasoning-Fermi
     problems (Kalyan et al., 2021), Lambada (Paperno et al., 2016), HellaSwag (Zellers et al.,
     2019), DROP (Dua et al., 2019).
   • Summarization: We use 5 benchmarks: XL Sum (English), XL Sum (non-English languages)
     (Hasan et al., 2021), WikiLingua (non-English languages), WikiLingua (English) (Ladhak et al.,
     2020), XSum (Narayan et al., 2018).
   • Multilinguality: We use 10 benchmarks: XLSum (Non-English languages) (Hasan et al., 2021),
     WMT22 (Kocmi et al., 2022), WMT23 (Tom et al., 2023), FRMT (Riley et al., 2023), WikiLingua
     (Non-English languages) (Ladhak et al., 2020), TydiQA (no context), TydiQA (GoldP) (Clark
     et al., 2020), MGSM (Shi et al., 2023), translated MMLU (Hendrycks et al., 2021a), NTREX
     (Federmann et al., 2022), FLORES-200 (Team et al., 2022).
   • Image and Video: We use 9 benchmarks for image understanding: MMMU (Yue et al., 2023),
     TextVQA (Singh et al., 2019), DocVQA (Mathew et al., 2021), ChartQA (Masry et al., 2022),
     InfographicVQA (Mathew et al., 2022), MathVista (Lu et al., 2023), AI2D (Kembhavi et al.,
     2016), VQAv2 (Goyal et al., 2017), XM3600 (Thapliyal et al., 2022) for multi-lingual image
     understanding, and 6 benchmarks for video understanding: VATEX (Wang et al., 2019) for
     captioning in two different languages, YouCook2 (Zhou et al., 2018), NextQA (Xiao et al.,
     2021), ActivityNet-QA (Yu et al., 2019), and Perception Test MCQA (Pătrăucean et al., 2023).
   • Audio: We use 5 benchmarks including automatic speech recognition (ASR) tasks such as
     FLEURS (Conneau et al., 2023), VoxPopuli (Wang et al., 2021), Multi-lingual Librispeech (Pratap
     et al., 2020), and automatic speech translation task such as CoVoST 2 (Wang et al., 2020).




                                                                                                       75
                              Gemini: A Family of Highly Capable Multimodal Models



10.4. Qualitative Examples

This section shows sample qualitative examples from prompting the Gemini Ultra model. Some
illustrative examples of multimodal reasoning for image understanding tasks over charts, natural
images and memes are shown in Figures 10, 11, 13, 15, 16, and 17. Figure 12 shows an example of
image generation capabilities of Gemini Ultra where the user generates an interleaved sequence of
image and text to design a blog post. Beyond English, Figure 18 shows model’s capability to understand
images in a multilingual setting. Gemini models also show strong performance on multimodal image
understanding and reasoning in mathematics, as shown in Figures 14, 20 and 21. Figure 22 is an
example of complex multimodal reasoning demonstrating how the model composes complex image
understanding, code generation, and instruction following capabilities for a given user task. In Figure
19, we see another example of the model being able to generate working code and follow complex
user instructions. Finally, Figure 23 shows an example of Gemini Ultra’s capability of understanding
video by reasoning over temporally connected set of frames.




                                                                                                     76
                                   Gemini: A Family of Highly Capable Multimodal Models



10.4.1. Chart understanding and reasoning over data
   Prompt




   Spot a data point that stands out in these charts and what that implicates. Then produce a
   detailed markdown table for all the data shown.

   Model Response (rendered Markdown)

   The data point that stands out is the United States’ landfilled plastic waste, which is 73%. This
   is significantly higher than the other countries and regions shown. This implies that the United
   States is not doing as much to recycle or incinerate its plastic waste as other countries and
   regions.

    Country/Region          Landfilled (%)   Mismanaged (%)      Incinerated (%)    Recycled (%)
    World                        49                 22                  19                9
    United States                73                  4                  19                4
    Europe                       44                  6                  38                12
    Asia (excl. China and        39                 34                  19                8
    India)



Figure 10 | Solving a problem requiring multimodal chart understanding.The model has to read the
text, understand the connections between different data points and reason over them to recommend
an interesting point and follow the instructions to generate a markdown table (shown correctly
rendered).
Source: Our World In Data (Ritchie et al., 2023).




                                                                                                       77
                              Gemini: A Family of Highly Capable Multimodal Models



10.4.2. Multimodal question answering
   Prompt




   Do you know what it this plant? How do I best take care of it?

   Model Response

   This is a Persian shield plant. It is a tropical plant that is native to Southeast Asia. It is a
   popular houseplant because it is easy to care for and has beautiful purple leaves.

   Persian shield plants prefer bright, indirect light. They can tolerate some direct sunlight, but
   too much direct sunlight can scorch their leaves. They also prefer warm temperatures and
   high humidity.

   Persian shield plants need to be watered regularly, but they should not be allowed to sit in
   water. The soil should be kept moist but not soggy. They also benefit from being misted regularly.

   Persian shield plants can be fertilized every two weeks during the growing season. A balanced
   fertilizer is a good choice.

   Persian shield plants are susceptible to pests, such as mealybugs and spider mites. These pests
   can be controlled with insecticidal soap or neem oil.

   Persian shield plants can be propagated by stem cuttings. The cuttings should be taken in the
   spring or summer and should be rooted in water or a moist potting mix.


Figure 11 | Answering a multimodal information-seeking query. The model is able to recognize the
specific plant shown in the image and provide information about it. The model shows robustness to
typos as it is able to understand the user question despite them.
Source: photo taken by an author from the Gemini team.

                                                                                                        78
                             Gemini: A Family of Highly Capable Multimodal Models



10.4.3. Interleaved image and text generation
   Prompt

   Please create a blog post about a trip to New York, where a dog and his owner had lots of fun.
   Include a few pictures of the dog posing happily at different landmarks.

   Model Response

   Hi, I’m Cooper. I’m a 2 year old Goldendoodle who loves to travel. Last month, my family and
   I went to New York City. It was my first time in the Big Apple, and I had a blast!

   We stayed in a hotel near Times Square. It was so much fun to see all the bright lights and
   people. We went to see a Broadway show, and I even got to meet some of the actors.

   One of my favorite parts of New York City was Central Park. It was so big and beautiful. I
   loved running around and playing fetch. We also went to the Statue of Liberty and the Empire
   State Building. I was so excited to see all the famous landmarks.

   I had so much fun in New York City. It’s definitely one of my favorite places to travel. I can’t
   wait to go back someday!




Figure 12 | Generating interleaved text and images. The model is able to follow the instructions of
generating a blog post with images closely related to the text and with dog images showing high
levels of consistency across all images.




                                                                                                      79
                            Gemini: A Family of Highly Capable Multimodal Models



10.4.4. Image understanding and reasoning
    Prompt




    Look at this sequence of three shapes. What shape should come as the fourth shape? Explain
    your reasoning with detailed descriptions of the first shapes.

    Model Response

    The fourth shape should be a hexagon. The first shape is a triangle, the second shape is a
    square, and the third shape is a pentagon. The number of sides in each shape is increasing
    by one


Figure 13 | Solving a multimodal reasoning problem.The model is able to recognize shapes in the
image, understand their properties and reason about the relationship between them to predict the
next object.
Source: photo taken by an author from the Gemini team.




                                                                                                 80
                             Gemini: A Family of Highly Capable Multimodal Models



10.4.5. Geometrical reasoning
    Prompt




    Find the height of the parallelogram given its area with 100 square units.

    Model Response

    The area of the parallelogram is equal to the product of the base and the height. Hence
    100 = ( 𝑥 + 15) 𝑥 . We get 𝑥 2 + 15 𝑥 − 100 = 0. The roots of this equation are 𝑥1 = 5 and
    𝑥2 = −20. The second root is negative and is not acceptable. Final Answer: The final answer
    is 5. I hope it is correct.


Figure 14 | Solving a geometrical reasoning task. The model shows good understanding of the task
and is able to provide meaningful reasoning steps despite slightly unclear instructions.
Source: Lu et al. (2021).


10.4.6. Information seeking about objects
    Prompt




    Find a connection between these. Hint: think about historical events.

    Model Response

    The Moon is the only celestial body in which humans have played golf. In 1971, the Apollo
    14 crew hit two golf balls on the lunar surface.


Figure 15 | Solving a puzzle using multimodal inputs. The model recognizes the objects in the images
and identifies a commonality that connects the two objects.
Source: photo taken by an author from the Gemini team.




                                                                                                  81
                             Gemini: A Family of Highly Capable Multimodal Models



10.4.7. Multimodal reasoning based on visual cues
    Prompt




    Do you know the precise location where this image was taken?

    Model Response

    This image was taken in New York City. The Empire State Building is visible in the background.
    The street in the foreground is 8th Avenue. The cross street is West 34th Street.


Figure 16 | Identifying the objects in the image (the Empire State Building) and recognizing what
those are even with small levels of visual distortion in the image. Based on the image, the model is
also able to correctly identify the precise location of the person taking the photo.
Source: photo taken by an author from the Gemini team.




                                                                                                     82
                             Gemini: A Family of Highly Capable Multimodal Models



10.4.8. Multimodal humor understanding
    Prompt




    Can you explain this meme?

    Model Response

    The meme is about a person who is playing a game at 300 FPS, but their monitor is only
    75Hz. The person is trying to catch the game, but it is too fast for their monitor to display.
    The meme is funny because it is relatable to anyone who has ever played a game on a monitor
    that is not fast enough.


Figure 17 | Explanation of humor in a meme. The model is showing the ability to not only describe
what is happening in the image but also what it means even though the cultural context is not
mentioned explicitly in the image or the prompt.
Source: Hwang and Shwartz (2023).




                                                                                                     83
                            Gemini: A Family of Highly Capable Multimodal Models



10.5. Commonsense reasoning in a multilingual setting
    Prompt




    Explain what i should call my grandparents on each side of my family based on this chart

    Model Response

    On your father’s side, you would call your grandparents Yéye and Nǎinai. On your mother’s
    side, you would call your grandparents Wàigōng and Wàipó.


Figure 18 | Common-sense reasoning in images. The model is able to understand the relationships
represented in the graphs and reason about them in a multilingual setting.
Source: image created by an author from the Gemini team.




                                                                                                 84
                                               Gemini: A Family of Highly Capable Multimodal Models



10.5.1. Reasoning and code generation
  Prompt

  Create a web app called "Opossum Search":

  1. Every time you make a search query, it should redirect you to a google search with the same
  query, but the word opossum before it.
  2. It should be visually similar to Google search,
  3. Instead of the google logo, it should have a picture of an opossum from the internet.
  4. It should be a single html file, no separate js or css files.
  5. It should say "Powered by google search" in the footer.


  Model Response (rendered code)

  Here is the code for Opossum Search:
    1   <!DOCTYPE html>
    2   <html>
    3   <head>
    4    <title>Opossum Search</title>
    5    <style>
    6      body {
    7        font−family: Arial, sans−serif;
    8        margin: 0;
    9        padding: 0;
   10      }
   11
   12      header {
   13        background−color: #333;
   14        color: #fff;
   15        text−align: center;
   16        padding: 1rem 0;
   17      }
   18
   19      .search−bar {
   20        display: flex;
   21        justify−content: center;
   22        align−items: center;
   23        height: 50px;
   24        margin−top: 1rem;
   25      }
   26
   27      .search−input {
   28        width: 50%;
   29        padding: 0.5rem;
   30        border: 1px solid #ccc;
   31        border−radius: 5px;
   32      }
   33
   34      .search−button {
   35        margin−left: 1rem;
   36        padding: 0.5rem 1rem;
   37        border: 1px solid #ccc;
   38        border−radius: 5px;
   39        cursor: pointer;
   40      }
   41
   42      footer {
   43        background−color: #333;
   44        color: #fff;
   45        text−align: center;
   46        padding: 1rem 0;
   47        margin−top: 1rem;
   48      }
   49    </style>
   50   </head>
   51   <body>
   52    <header>
   53      <h1>Opossum Search</h1>
   54      <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Didelphis_virginiana.jpg/
   55      1200px−Didelphis_virginiana.jpg" alt="Opossum" width="100">
   56    </header>
   57
   58    <div class="search−bar">
   59     <input type="text" class="search−input" placeholder="Search the web">
   60     <button class="search−button">Search</button>
   61    </div>




                                                                                                           85
                                             Gemini: A Family of Highly Capable Multimodal Models




   62    <footer>
   63     Powered by Google Search
   64    </footer>
   65
   66    <script>
   67     const searchInput = document.querySelector('.search−input');
   68     const searchButton = document.querySelector('.search−button');
   69
   70      searchButton.addEventListener('click', () => {
   71        const query = searchInput.value;
   72        if (query) {
   73          window.location.href = `https://www.google.com/search?q=opossum+${query}`;
   74        }
   75      });
   76    </script>
   77   </body>


   This code creates a simple web app that looks similar to Google Search, but with an opossum
   logo. When you enter a search query and click the "Search" button, it will redirect you to a
   Google search with the word "opossum" added to the beginning of your query. The app is
   powered by Google Search, as indicated in the footer.

   Rendered Website




Figure 19 | Writing code for a website based on user request. The model follows the instructions and
requirements defined by the user and converts them to functioning code.




                                                                                                    86
                                Gemini: A Family of Highly Capable Multimodal Models



10.5.2. Mathematics: Calculus
   Prompt

   Find the derivative of sinh 𝑥 + cosh 𝑦 = 𝑥 + 𝑦 .

   Model Response (rendered latex)

   We have that 𝑑𝑥
                𝑑
                   sinh 𝑥 = cosh 𝑥 and 𝑑𝑥
                                       𝑑
                                          cosh 𝑦 = sinh 𝑦 𝑑𝑑𝑥𝑦 . Therefore,

                                                           𝑑𝑦     𝑑𝑦
                                        cosh 𝑥 + sinh 𝑦       =1+    .
                                                           𝑑𝑥     𝑑𝑥
   Solving for 𝑑𝑑𝑥𝑦 , we find that

                                               𝑑𝑦   1 − cosh 𝑥
                                                  =            .
                                               𝑑𝑥   sinh 𝑦 − 1
   Final Answer: The final answer is 1−cosh  𝑥
                                     sinh 𝑦 −1
                                               .


Figure 20 | Solving a calculus problem. The model is able to get a solution to a calculus problem with
step-by-step explanation and correctly defined LaTeX equations.
Source: question is provided by Macmillan Learning.




                                                                                                    87
                              Gemini: A Family of Highly Capable Multimodal Models



10.6. Multi-step reasoning and mathematics
   Prompt

   In the year 2018 factory A produced goods worth 500 million forints, while factory B produced
   goods worth 400 million forints. According to their long-term plans, factory A will increase
   production by 5% each year, while factory B will increase production by 6% per year.

   The above plans were published on an economy website. A debate started in the comment
   section of the article: one user claimed that the difference between the volumes of production
   between the two factories would get smaller and smaller in the next couple of years.

   Calculate the missing values of the table below and show that the above claim is not true.

   Header: | 2018 | 2019 | 2020 | 2021 |
   Production volume of factory A (mln) | 500 | | | |
   Production volume of factory B (mln) | 400 | | | |

   Model Response (rendered Markdown)

    Year        Production volume of           Production volume of             Difference
                factory A (mln)                factory B (mln)                  (mln)
    2018        500                            400                              100
    2019        525                            424                              101
    2020        551.25                         449.44                           101.81
    2021        578.81                         476.41                           102.4
   The production volume of factory A in 2019 is 500 * 1.05 = 525 million forints.
   The production volume of factory B in 2019 is 400 * 1.06 = 424 million forints.
   The difference in production volume in 2019 is 525 - 424 = 101 million forints.

   The production volume of factory A in 2020 is 525 * 1.05 = 551.25 million forints.
   The production volume of factory B in 2020 is 424 * 1.06 = 449.44 million forints.
   The difference in production volume in 2020 is 551.25 - 449.44 = 101.81 million forints.

   The production volume of factory A in 2021 is 551.25 * 1.05 = 578.81 million forints.
   The production volume of factory B in 2021 is 449.44 * 1.06 = 476.41 million forints.
   The difference in production volume in 2021 is 578.81 - 476.41 = 102.4 million forints.

   As you can see, the difference in production volume between the two factories is actually
   increasing, not decreasing. Therefore, the claim that the difference between the volumes of
   production between the two factories would get smaller and smaller in the next couple of years
   is not true.

Figure 21 | Solving a multi-step math problem. The model is able to understand the task and generate
a markdown table with correctly calculated values. It also explicitly follows the instructions to show
where the numbers come from and answer the question given in the task.
Source: Oktatási Hivatal (2023, p. 20)




                                                                                                    88
                                          Gemini: A Family of Highly Capable Multimodal Models



10.6.1. Complex image understanding, code generation, and instruction following
   Prompt




   I want you to take the function depicted in the top left subplot, multiply it by 1000, and then
   add it to the function depicted in the bottom left subplot. Generate matplotlib code for the
   single resulting plot.

   Model Response (rendered code)

    1   import numpy as np
    2   import matplotlib.pyplot as plt
    3
    4   x = np.linspace(0, 10, 1000)
    5   y1 = np.sin(x)
    6   y2 = np.exp(x)
    7
    8   plt.plot(x, 1000∗y1 + y2)
    9   plt.show()




   Rendered Graph




Figure 22 | Multimodal reasoning capabilities applied to code generation. Gemini Ultra needs to
perform inverse graphics task to infer the code that would have generated the plots, perform additional
mathematical transformations, and generate relevant code.
Source: figure generated by an author from the Gemini team.



                                                                                                     89
                             Gemini: A Family of Highly Capable Multimodal Models



10.6.2. Video understanding and reasoning
   Prompt (video)




   How could this person improve their technique?

   Model Response

   This soccer player needs to work on their ball-striking mechanics. Their foot is not making
   good contact with the ball, which is causing the ball to fly up and away from the goal. They
   also need to work on their body positioning. They should be leaning into the shot with their
   non-kicking leg in front of their kicking leg, and they should be following through with their
   kicking leg.

Figure 23 | Video understanding and reasoning over the situation presented in the video. Here, we
provide a video as input to the model together with a text prompt (images are provided here only
for visualization purposes). The model is able to analyze what happened in the video and provide
recommendations on how the actions in the video could have been better.
Video source: "Football/Soccer Penalty Miss"
https://www.youtube.com/watch?v=VmWxjmJ3mvs




                                                                                                    90