Sources — Compute-Structure AI History

                                                                                                                               2025-05-15


                                                                  Qwen3 Technical Report
                                                                               Qwen Team

                                                                https://huggingface.co/Qwen
                                                                https://modelscope.cn/organization/qwen
                                                                https://github.com/QwenLM/Qwen3

                                                                                Abstract

                                         In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3
                                         comprises a series of large language models (LLMs) designed to advance performance,
                                         efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense
                                         and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to
                                         235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex,
                                         multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a




arXiv:2505.09388v1 [cs.CL] 14 May 2025
                                         unified framework. This eliminates the need to switch between different models—–such
                                         as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-
                                         32B)—–and enables dynamic mode switching based on user queries or chat templates.
                                         Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate
                                         computational resources adaptively during inference, thereby balancing latency and
                                         performance based on task complexity. Moreover, by leveraging the knowledge from the
                                         flagship models, we significantly reduce the computational resources required to build
                                         smaller-scale models, while ensuring their highly competitive performance. Empirical
                                         evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse
                                         benchmarks, including tasks in code generation, mathematical reasoning, agent tasks,
                                         etc., competitive against larger MoE models and proprietary models. Compared to its
                                         predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages
                                         and dialects, enhancing global accessibility through improved cross-lingual understand-
                                         ing and generation capabilities. To facilitate reproducibility and community-driven
                                         research and development, all Qwen3 models are publicly accessible under Apache 2.0.




                                                                                    1
1   Introduction

The pursuit of artificial general intelligence (AGI) or artificial super intelligence (ASI) has long been a goal
for humanity. Recent advancements in large foundation models, e.g., GPT-4o (OpenAI, 2024), Claude
3.7 (Anthropic, 2025), Gemini 2.5 (DeepMind, 2025), DeepSeek-V3 (Liu et al., 2024a), Llama-4 (Meta-AI,
2025), and Qwen2.5 (Yang et al., 2024b), have demonstrated significant progress toward this objective.
These models are trained on vast datasets spanning trillions of tokens across diverse domains and tasks,
effectively distilling human knowledge and capabilities into their parameters. Furthermore, recent
developments in reasoning models, optimized through reinforcement learning, highlight the potential
for foundation models to enhance inference-time scaling and achieve higher levels of intelligence, e.g.,
o3 (OpenAI, 2025), DeepSeek-R1 (Guo et al., 2025). While most state-of-the-art models remain proprietary,
the rapid growth of open-source communities has substantially reduced the performance gap between
open-weight and closed-source models. Notably, an increasing number of top-tier models (Meta-AI, 2025;
Liu et al., 2024a; Guo et al., 2025; Yang et al., 2024b) are now being released as open-source, fostering
broader research and innovation in artificial intelligence.
In this work, we introduce Qwen3, the latest series in our foundation model family, Qwen. Qwen3 is
a collection of open-weight large language models (LLMs) that achieve state-of-the-art performance
across a wide variety of tasks and domains. We release both dense and Mixture-of-Experts (MoE) models,
with the number of parameters ranging from 0.6 billion to 235 billion, to meet the needs of different
downstream applications. Notably, the flagship model, Qwen3-235B-A22B, is an MoE model with a
total of 235 billion parameters and 22 billion activated ones per token. This design ensures both high
performance and efficient inference.
Qwen3 introduces several key advancements to enhance its functionality and usability. First, it integrates
two distinct operating modes, thinking mode and non-thinking mode, into a single model. This allows
users to switch between these modes without alternating between different models, e.g., switching from
Qwen2.5 to QwQ (Qwen Team, 2024). This flexibility ensures that developers and users can adapt the
model’s behavior to suit specific tasks efficiently. Additionally, Qwen3 incorporates thinking budgets, pro-
viding users with fine-grained control over the level of reasoning effort applied by the model during task
execution. This capability is crucial to the optimization of computational resources and performance, tai-
loring the model’s thinking behavior to meet varying complexity in real-world applications. Furthermore,
Qwen3 has been pre-trained on 36 trillion tokens covering up to 119 languages and dialects, effectively
enhancing its multilingual capabilities. This broadened language support amplifies its potential for
deployment in global use cases and international applications. These advancements together establish
Qwen3 as a cutting-edge open-source large language model family, capable of effectively addressing
complex tasks across various domains and languages.
The pre-training process for Qwen3 utilizes a large-scale dataset consisting of approximately 36 trillion
tokens, curated to ensure linguistic and domain diversity. To efficiently expand the training data, we
employ a multi-modal approach: Qwen2.5-VL (Bai et al., 2025) is finetuned to extract text from extensive
PDF documents. We also generate synthetic data using domain-specific models: Qwen2.5-Math (Yang
et al., 2024c) for mathematical content and Qwen2.5-Coder (Hui et al., 2024) for code-related data. The
pre-training process follows a three-stage strategy. In the first stage, the model is trained on about 30
trillion tokens to build a strong foundation of general knowledge. In the second stage, it is further trained
on knowledge-intensive data to enhance reasoning abilities in areas like science, technology, engineering,
and mathematics (STEM) and coding. Finally, in the third stage, the model is trained on long-context
data to increase its maximum context length from 4,096 to 32,768 tokens.
To better align foundation models with human preferences and downstream applications, we employ a
multi-stage post-training approach that empowers both thinking (reasoning) and non-thinking modes. In
the first two stages, we focus on developing strong reasoning abilities through long chain-of-thought
(CoT) cold-start finetuning and reinforcement learning focusing on mathematics and coding tasks. In the
final two stages, we combine data with and without reasoning paths into a unified dataset for further
fine-tuning, enabling the model to handle both types of input effectively, and we then apply general-
domain reinforcement learning to improve performance across a wide range of downstream tasks. For
smaller models, we use strong-to-weak distillation, leveraging both off-policy and on-policy knowledge
transfer from larger models to enhance their capabilities. Distillation from advanced teacher models
significantly outperforms reinforcement learning in performance and training efficiency.
We evaluate both pre-trained and post-trained versions of our models across a comprehensive set of
benchmarks spanning multiple tasks and domains. Experimental results show that our base pre-trained
models achieve state-of-the-art performance. The post-trained models, whether in thinking or non-
thinking mode, perform competitively against leading proprietary models and large mixture-of-experts
(MoE) models such as o1, o3-mini, and DeepSeek-V3. Notably, our models excel in coding, mathematics,
and agent-related tasks. For example, the flagship model Qwen3-235B-A22B achieves 85.7 on AIME’24


                                                       2
and 81.5 on AIME’25 (AIME, 2025), 70.7 on LiveCodeBench v5 (Jain et al., 2024), 2,056 on CodeForces,
and 70.8 on BFCL v3 (Yan et al., 2024). In addition, other models in the Qwen3 series also show strong
performance relative to their size. Furthermore, we observe that increasing the thinking budget for
thinking tokens leads to a consistent improvement in the model’s performance across various tasks.
In the following sections, we describe the design of the model architecture, provide details on its training
procedures, present the experimental results of pre-trained and post-trained models, and finally, conclude
this technical report by summarizing the key findings and outlining potential directions for future
research.

2     Architecture
The Qwen3 series includes 6 dense models, namely Qwen3-0.6B, Qwen3-1.7B, Qwen3-4B, Qwen3-8B,
Qwen3-14B, and Qwen3-32B, and 2 MoE models, Qwen3-30B-A3B and Qwen3-235B-A22B. The flagship
model, Qwen3-235B-A22B, has a total of 235B parameters with 22B activated ones. Below, we elaborate
on the architecture of the Qwen3 models.
The architecture of the Qwen3 dense models is similar to Qwen2.5 (Yang et al., 2024b), including using
Grouped Query Attention (GQA, Ainslie et al., 2023), SwiGLU (Dauphin et al., 2017), Rotary Positional
Embeddings (RoPE, Su et al., 2024), and RMSNorm (Jiang et al., 2023) with pre-normalization. Besides,
we remove QKV-bias used in Qwen2 (Yang et al., 2024a) and introduce QK-Norm (Dehghani et al., 2023)
to the attention mechanism to ensure stable training for Qwen3. Key information on model architecture
is provided in Table 1.
The Qwen3 MoE models share the same fundamental architecture as the Qwen3 dense models. Key
information on model architecture is provided in Table 2. We follow Qwen2.5-MoE (Yang et al., 2024b)
and implement fine-grained expert segmentation (Dai et al., 2024). The Qwen3 MoE models have 128 total
experts with 8 activated experts per token. Unlike Qwen2.5-MoE, the Qwen3-MoE design excludes shared
experts. Furthermore, we adopt the global-batch load balancing loss (Qiu et al., 2025) to encourage expert
specialization. These architectural and training innovations have yielded substantial improvements in
model performance across downstream tasks.
Qwen3 models utilize Qwen’s tokenizer (Bai et al., 2023), which implements byte-level byte-pair encoding
(BBPE, Brown et al., 2020; Wang et al., 2020; Sennrich et al., 2016) with a vocabulary size of 151,669.

                          Table 1: Model architecture of Qwen3 dense models.

                 Models        Layers     Heads (Q / KV)        Tie Embedding     Context Length
                 Qwen3-0.6B        28          16 / 8                 Yes               32K
                 Qwen3-1.7B        28          16 / 8                 Yes               32K
                 Qwen3-4B          36          32 / 8                 Yes               128K
                 Qwen3-8B          36          32 / 8                 No                128K
                 Qwen3-14B         40          40 / 8                 No                128K
                 Qwen3-32B         64          64 / 8                 No                128K


                           Table 2: Model architecture of Qwen3 MoE models.

        Models              Layers      Heads (Q / KV)       # Experts (Total / Activated)   Context Length
        Qwen3-30B-A3B         48            32 / 4                     128 / 8                   128K
        Qwen3-235B-A22B       94            64 / 4                     128 / 8                   128K



3     Pre-training
In this section, we describe the construction of our pretraining data, the details of our pretraining
approach, and present experimental results from evaluating the base models on standard benchmarks.

3.1   Pre-training Data

Compared with Qwen2.5 (Yang et al., 2024b), we have significantly expanded the scale and diversity of
our training data. Specifically, we collected twice as many pre-training tokens—covering three times
more languages. All Qwen3 models are trained on a large and diverse dataset consisting of 119 languages
and dialects, with a total of 36 trillion tokens. This dataset includes high-quality content in various


                                                         3
domains such as coding, STEM (Science, Technology, Engineering, and Mathematics), reasoning tasks,
books, multilingual texts, and synthetic data.
To further expand the pre-training data corpus, we first employ the Qwen2.5-VL model (Bai et al., 2025)
to perform text recognition on a large volume of PDF-like documents. The recognized text is then refined
using the Qwen2.5 model (Yang et al., 2024b), which helps improve its quality. Through this two-step
process, we are able to obtain an additional set of high-quality text tokens, amounting to trillions in total.
Besides, we employ Qwen2.5 (Yang et al., 2024b), Qwen2.5-Math (Yang et al., 2024c), and Qwen2.5-Coder
(Hui et al., 2024) models to synthesize trillions of text tokens in different formats, including textbooks,
question-answering, instructions, and code snippets, covering dozens of domains. Finally, we further
expand the pre-training corpus by incorporating additional multilingual data and introducing more
languages. Compared to the pre-training data used in Qwen2.5, the number of supported languages has
been significantly increased from 29 to 119, enhancing the model’s linguistic coverage and cross-lingual
capabilities.
We have developed a multilingual data annotation system designed to enhance both the quality and
diversity of training data. This system has been applied to our large-scale pre-training datasets, annotating
over 30 trillion tokens across multiple dimensions such as educational value, fields, domains, and safety.
These detailed annotations support more effective data filtering and combination. Unlike previous
studies (Xie et al., 2023; Fan et al., 2023; Liu et al., 2024b) that optimize the data mixture at the data source
or domain level, our method optimizes the data mixture at the instance-level through extensive ablation
experiments on small proxy models with the fine-grained data labels.

3.2   Pre-training Stage

The Qwen3 models are pre-trained through a three-stage process:

      (1) General Stage (S1): At the first pre-training stage, all Qwen3 models are trained on over 30
          trillion tokens using a sequence length of 4,096 tokens. At this stage, the models have been fully
          pre-trained on language proficiency and general world knowledge, with training data covering
          119 languages and dialects.
      (2) Reasoning Stage (S2): To further improve the reasoning ability, we optimize the pre-training
          corpus of this stage by increasing the proportion of STEM, coding, reasoning, and synthetic data.
          The models are further pre-trained with about 5T higher-quality tokens at a sequence length of
          4,096 tokens. We also accelerate the learning rate decay during this stage.
      (3) Long Context Stage: In the final pre-training stage, we collect high-quality long context corpora
          to extend the context length of Qwen3 models. All models are pre-trained on hundreds of billions
          of tokens with a sequence length of 32,768 tokens. The long context corpus includes 75% of text
          between 16,384 to 32,768 tokens in length, and 25% of text between 4,096 to 16,384 in length.
          Following Qwen2.5 (Yang et al., 2024b), we increase the base frequency of RoPE from 10,000 to
          1,000,000 using the ABF technique (Xiong et al., 2023). Meanwhile, we introduce YARN (Peng
          et al., 2023) and Dual Chunk Attention (DCA, An et al., 2024) to achieve a four-fold increase in
          sequence length capacity during inference.

Similar to Qwen2.5 (Yang et al., 2024b), we develop scaling laws for optimal hyper-parameters (e.g.,
learning rate scheduler, and batch size) predictions based on three pre-training stages mentioned above.
Through extensive experiments, we systematically study the relationship between model architecture,
training data, training stage, and optimal training hyper-parameters. Finally, we set the predicted optimal
learning rate and batch size strategy for each dense or MoE model.

3.3   Pre-training Evaluation

We conduct comprehensive evaluations of the base language models of the Qwen3 series. The evaluation
of base models mainly focuses on their performance in general knowledge, reasoning, mathematics,
scientific knowledge, coding, and multilingual capabilities. The evaluation datasets for pre-trained base
models include 15 benchmarks:

       • General Tasks: MMLU (Hendrycks et al., 2021a) (5-shot), MMLU-Pro (Wang et al., 2024) (5-
         shot, CoT), MMLU-redux (Gema et al., 2024) (5-shot), BBH (Suzgun et al., 2023) (3-shot, CoT),
         SuperGPQA (Du et al., 2025)(5-shot, CoT).
       • Math & STEM Tasks: GPQA (Rein et al., 2023) (5-shot, CoT), GSM8K (Cobbe et al., 2021) (4-shot,
         CoT), MATH (Hendrycks et al., 2021b) (4-shot, CoT).



                                                       4
     • Coding Tasks: EvalPlus (Liu et al., 2023a) (0-shot) (Average of HumanEval (Chen et al., 2021),
       MBPP (Austin et al., 2021), Humaneval+, MBPP+) (Liu et al., 2023a), MultiPL-E (Cassano et al.,
       2023) (0-shot) (Python, C++, JAVA, PHP, TypeScript, C#, Bash, JavaScript), MBPP-3shot (Austin
       et al., 2021), CRUX-O of CRUXEval (1-shot) (Gu et al., 2024).
     • Multilingual Tasks: MGSM (Shi et al., 2023) (8-shot, CoT), MMMLU (OpenAI, 2024) (5-shot),
       INCLUDE (Romanou et al., 2024) (5-shot).

For the base model baselines, we compare the Qwen3 series base models with the Qwen2.5 base models
(Yang et al., 2024b) and other leading open-source base models, including DeepSeek-V3 Base (Liu et al.,
2024a), Gemma-3 (Team et al., 2025), Llama-3 (Dubey et al., 2024), and Llama-4 (Meta-AI, 2025) series
base models, in terms of scale of parameters. All models are evaluated using the same evaluation pipeline
and the widely-used evaluation settings to ensure fair comparison.

Summary of Evaluation Results Based on the overall evaluation results, we highlight some key
conclusions of Qwen3 base models.

    (1) Compared with the previously open-source SOTA dense and MoE base models (such as DeepSeek-
        V3 Base, Llama-4-Maverick Base, and Qwen2.5-72B-Base), Qwen3-235B-A22B-Base outperforms
        these models in most tasks with significantly fewer total parameters or activated parameters.
    (2) For the Qwen3 MoE base models, our experimental results indicate that: (a) Using the same
        pre-training data, Qwen3 MoE base models can achieve similar performance to Qwen3 dense
        base models with only 1/5 activated parameters. (b) Due to the improvements of the Qwen3
        MoE architecture, the scale-up of the training tokens, and more advanced training strategies,
        the Qwen3 MoE base models can outperform the Qwen2.5 MoE base models with less than 1/2
        activated parameters and fewer total parameters. (c) Even with 1/10 of the activated parameters of
        the Qwen2.5 dense base model, the Qwen3 MoE base model can achieve comparable performance,
        which brings us significant advantages in inference and training costs.
    (3) The overall performance of the Qwen3 dense base models is comparable to the Qwen2.5 base
        models at higher parameter scales. For example, Qwen3-1.7B/4B/8B/14B/32B-Base achieve
        comparable performance to Qwen2.5-3B/7B/14B/32B/72B-Base, respectively. Especially in
        STEM, coding, and reasoning benchmarks, the performance of Qwen3 dense base models even
        surpasses Qwen2.5 base models at higher parameter scales.

The detailed results are as follows.

Qwen3-235B-A22B-Base We compare Qwen3-235B-A22B-Base to our previous similar-sized MoE
Qwen2.5-Plus-Base (Yang et al., 2024b) and other leading open-source base models: Llama-4-Maverick
(Meta-AI, 2025), Qwen2.5-72B-Base (Yang et al., 2024b), DeepSeek-V3 Base (Liu et al., 2024a). From
the results in Table 3, the Qwen3-235B-A22B-Base model attains the highest performance scores across
most of the evaluated benchmarks. We further compare Qwen3-235B-A22B-Base with other baselines
separately for the detailed analysis.

    (1) Compared with the recently open-source model Llama-4-Maverick-Base, which has about twice
        the number of parameters, Qwen3-235B-A22B-Base still performs better on most benchmarks.
    (2) Compared with the previously state-of-the-art open-source model DeepSeek-V3-Base, Qwen3-
        235B-A22B-Base outperforms DeepSeek-V3-Base on 14 out of 15 evaluation benchmarks with
        only about 1/3 the total number of parameters and 2/3 activated parameters, demonstrating the
        powerful and cost-effectiveness of our models.
    (3) Compared with our previous MoE Qwen2.5-Plus of similar size, Qwen3-235B-A22B-Base sig-
        nificantly outperforms it with fewer parameters and activated parameters, which shows the
        remarkable advantages of Qwen3 in pre-training data, training strategy, and model architecture.
    (4) Compared with our previous flagship open-source dense model Qwen2.5-72B-Base, Qwen3-
        235B-A22B-Base surpasses the latter in all benchmarks and uses fewer than 1/3 of the activated
        parameters. Meanwhile, due to the advantage of the model architecture, the inference costs and
        training costs on each trillion tokens of Qwen3-235B-A22B-Base are much cheaper than those of
        Qwen2.5-72B-Base.

Qwen3-32B-Base Qwen3-32B-Base is our largest dense model among the Qwen3 series. We compare
it to the baselines of similar sizes, including Gemma-3-27B (Team et al., 2025) and Qwen2.5-32B (Yang
et al., 2024b). In addition, we introduce two strong baselines: the recently open-source MoE model Llama-
4-Scout, which has three times the parameters of Qwen3-32B-Base but half the activated parameters;


                                                    5
Table 3: Comparison among Qwen3-235B-A22B-Base and other representative strong open-source
baselines. The highest, the second-best scores are shown in bold and underlined, respectively.

                       Qwen2.5-72B Qwen2.5-Plus Llama-4-Maverick DeepSeek-V3 Qwen3-235B-A22B
                          Base        Base            Base          Base          Base
  Architecture           Dense         MoE                 MoE           MoE             MoE
  # Total Params          72B          271B                402B          671B            235B
  # Activated Params      72B           37B                 17B           37B             22B
                                               General Tasks
  MMLU                    86.06        85.02              85.16          87.19           87.81
  MMLU-Redux              83.91        82.69              84.05          86.14           87.40
  MMLU-Pro                58.07        63.52              63.91          59.84           68.18
  SuperGPQA               36.20        37.18              40.85          41.53           44.06
  BBH                     86.30        85.60              83.62          86.22           88.87
                                          Math & STEM Tasks
  GPQA                    45.88        41.92              43.94          41.92           47.47
  GSM8K                   91.50        91.89              87.72          87.57           94.39
  MATH                    62.12        62.78              63.32          62.62           71.84
                                               Coding Tasks
  EvalPlus                65.93        61.43              68.38          63.75           77.60
  MultiPL-E               58.70        62.16              57.28          62.26           65.94
  MBPP                    76.00        74.60              75.40          74.20           81.40
  CRUX-O                  66.20        68.50              77.00          76.60           79.00
                                           Multilingual Tasks
  MGSM                    82.40        82.21              79.69          82.68           83.53
  MMMLU                   84.40        83.49              83.09          85.88           86.70
  INCLUDE                 69.05        66.97              73.47          75.17           73.46




Table 4: Comparison among Qwen3-32B-Base and other strong open-source baselines. The highest
and second-best scores are shown in bold and underlined, respectively.

                         Qwen2.5-32B   Qwen2.5-72B       Gemma-3-27B    Llama-4-Scout   Qwen3-32B
                            Base          Base              Base            Base          Base
    Architecture             Dense         Dense               Dense        MoE           Dense
    # Total Params            32B           72B                 27B         109B           32B
    # Activated Params        32B           72B                 27B          17B           32B
                                               General Tasks
    MMLU                     83.32         86.06                78.69       78.27         83.61
    MMLU-Redux               81.97         83.91                76.53       71.09         83.41
    MMLU-Pro                 55.10         58.07                52.88       56.13         65.54
    SuperGPQA                33.55         36.20                29.87       26.51         39.78
    BBH                      84.48         86.30                79.95       82.40         87.38
                                          Math & STEM Tasks
    GPQA                     47.97         45.88                26.26       40.40         49.49
    GSM8K                    92.87         91.50                81.20       85.37         93.40
    MATH                     57.70         62.12                51.78       51.66         61.62
                                               Coding Tasks
    EvalPlus                 66.25         65.93                55.78       59.90         72.05
    MultiPL-E                58.30         58.70                45.03       47.38         67.06
    MBPP                     73.60         76.00                68.40       68.60         78.20
    CRUX-O                   67.80         66.20                60.00       61.90         72.50
                                           Multilingual Tasks
    MGSM                     78.12         82.40                73.74       79.93         83.06
    MMMLU                    82.40         84.40                77.62       74.83         83.83
    INCLUDE                  64.35         69.05                68.94       68.09         67.87




                                                    6
Table 5: Comparison among Qwen3-14B-Base, Qwen3-30B-A3B-Base, and other strong open-source
baselines. The highest and second-best scores are shown in bold and underlined, respectively.

                     Gemma-3-12B Qwen2.5-14B Qwen2.5-32B Qwen2.5-Turbo Qwen3-14B Qwen3-30B-A3B
                        Base        Base        Base         Base        Base        Base
Architecture            Dense       Dense            Dense           MoE         Dense      MoE
# Total Params           12B         14B              32B            42B          14B       30B
# Activated Params       12B         14B              32B             6B          14B        3B
                                                General Tasks
MMLU                    73.87        79.66            83.32          79.50        81.05     81.38
MMLU-Redux              70.70        76.64            81.97          77.11        79.88     81.17
MMLU-Pro                44.91        51.16            55.10          55.60        61.03     61.49
SuperGPQA               24.61        30.68            33.55          31.19        34.27     35.72
BBH                     74.28        78.18            84.48          76.10        81.07     81.54
                                             Math & STEM Tasks
GPQA                    31.31        32.83            47.97          41.41        39.90     43.94
GSM8K                   78.01        90.22            92.87          88.32        92.49     91.81
MATH                    44.43        55.64            57.70          55.60        62.02     59.04
                                                 Coding Tasks
EvalPlus                52.65        60.70            66.25          61.23        72.23     71.45
MultiPL-E               43.03        54.79            58.30          53.24        61.69     66.53
MBPP                    60.60        69.00            73.60          67.60        73.40     74.40
CRUX-O                  52.00        61.10            67.80          60.20        68.60     67.20
                                              Multilingual Tasks
MGSM                    64.35        74.68            78.12          70.45        79.20     79.11
MMMLU                   72.50        78.34            82.40          79.76        79.69     81.46
INCLUDE                 63.34        60.26            64.35          59.25        64.55     67.00




Table 6: Comparison among Qwen8B-Base and other strong open-source baselines. The highest and
second-best scores are shown in bold and underlined, respectively.

                                    Llama-3-8B      Qwen2.5-7B     Qwen2.5-14B   Qwen3-8B
                                       Base           Base            Base         Base
               Architecture           Dense            Dense         Dense        Dense
               # Total Params          8B               7B            14B          8B
               # Activated Params      8B               7B            14B          8B
                                                General Tasks
               MMLU                   66.60               74.16       79.66        76.89
               MMLU-Redux             61.59               71.06       76.64        76.17
               MMLU-Pro               35.36               45.00       51.16        56.73
               SuperGPQA              20.54               26.34       30.68        31.64
               BBH                    57.70               70.40       78.18        78.40
                                             Math & STEM Tasks
               GPQA                   25.80               36.36       32.83        44.44
               GSM8K                  55.30               85.36       90.22        89.84
               MATH                   20.50               49.80       55.64        60.80
                                                Coding Tasks
               EvalPlus               44.13               62.18       60.70        67.65
               MultiPL-E              31.45               50.73       54.79        58.75
               MBPP                   48.40               63.40       69.00        69.80
               CRUX-O                 36.80               48.50       61.10        62.00
                                              Multilingual Tasks
               MGSM                   38.92               63.60       74.68        76.02
               MMMLU                  59.65               71.34       78.34        75.72
               IINCLUDE               44.94               53.98       60.26        59.40




                                                      7
Table 7: Comparison among Qwen3-4B-Base and other strong open-source baselines. The highest and
second-best scores are shown in bold and underlined, respectively.

                                     Gemma-3-4B          Qwen2.5-3B          Qwen2.5-7B   Qwen3-4B
                                        Base               Base                Base         Base
              Architecture             Dense                 Dense             Dense           Dense
              # Total Params            4B                    3B                7B              4B
              # Activated Params        4B                    3B                7B              4B
                                                  General Tasks
              MMLU                       59.51               65.62             74.16           72.99
              MMLU-Redux                 56.91               63.68             71.06           72.79
              MMLU-Pro                   29.23               34.61             45.00           50.58
              SuperGPQA                  17.68               20.31             26.34           28.43
              BBH                        51.70               56.30             70.40           72.59
                                            Math & STEM Tasks
              GPQA                       24.24               26.26             36.36           36.87
              GSM8K                      43.97               79.08             85.36           87.79
              MATH                       26.10               42.64             49.80           54.10
                                                  Coding Tasks
              EvalPlus                   43.23               46.28             62.18           63.53
              MultiPL-E                  28.06               39.65             50.73           53.13
              MBPP                       46.40               54.60             63.40           67.00
              CRUX-O                     34.00               36.50             48.50           55.00
                                             Multilingual Tasks
              MGSM                       33.11               47.53             63.60           67.74
              MMMLU                      59.62               65.55             71.34           71.42
              INCLUDE                    49.06               45.90             53.98           56.29




Table 8: Comparison among Qwen3-1.7B-Base, Qwen3-0.6B-Base, and other strong open-source base-
lines. The highest and second-best scores are shown in bold and underlined, respectively.

                          Qwen2.5-0.5B     Qwen3-0.6B          Gemma-3-1B         Qwen2.5-1.5B         Qwen3-1.7B
                             Base            Base                 Base               Base                Base
     Architecture            Dense               Dense               Dense             Dense             Dense
     # Total Params           0.5B                0.6B                1B                1.5B              1.7B
     # Activated Params       0.5B                0.6B                1B                1.5B              1.7B
                                                  General Tasks
     MMLU                    47.50               52.81               26.26             60.90             62.63
     MMLU-Redux              45.10               51.26               25.99             58.46             61.66
     MMLU-Pro                15.69               24.74                9.72             28.53             36.76
     SuperGPQA               11.30               15.03                7.19             17.64             20.92
     BBH                     20.30               41.47               28.13             45.10             54.47
                                            Math & STEM Tasks
     GPQA                    24.75               26.77               24.75             24.24             28.28
     GSM8K                   41.62               59.59                2.20             68.54             75.44
     MATH                    19.48               32.44                3.66             35.00             43.50
                                                  Coding Tasks
     EvalPlus                31.85               36.23               8.98              44.80             52.70
     MultiPL-E               18.70               24.58               5.15              33.10             42.71
     MBPP                    29.80               36.60               9.20              43.60             55.40
     CRUX-O                  12.10               27.00               3.80              29.60             36.40
                                             Multilingual Tasks
     MGSM                    12.07               30.99                1.74             32.82             50.71
     MMMLU                   31.53               50.16               26.57             60.27             63.27
     INCLUDE                 24.74               34.26               25.62             39.55             45.57




                                                         8
and our previous flagship open-source dense model Qwen2.5-72B-Base, which has more than twice the
number of parameters compared to Qwen3-32B-Base. The results are shown in Table 4, which support
three key conclusions:

     (1) Compared with the similar-sized models, Qwen3-32B-Base outperforms Qwen2.5-32B-Base and
         Gemma-3-27B Base on most benchmarks. Notably, Qwen3-32B-Base achieves 65.54 on MMLU-
         Pro and 39.78 on SuperGPQA, significantly outperforming its predecessor Qwen2.5-32B-Base.
         In addition, Qwen3-32B-Base achieves significantly higher encoding benchmark scores than all
         baseline models.
     (2) Surprisingly, we find that Qwen3-32B-Base achieves competitive results compared to Qwen2.5-
         72B-Base. Although Qwen3-32B-Base has less than half the number of parameters of Qwen2.5-
         72B-Base, it outperforms Qwen2.5-72B-Base in 10 of the 15 evaluation benchmarks. On coding,
         mathematics, and reasoning benchmarks, Qwen3-32B-Base has remarkable advantages.
     (3) Compared to Llama-4-Scout-Base, Qwen3-32B-Base significantly outperforms it on all 15 bench-
         marks, with only one-third of the number of parameters of Llama-4-Scout-Base, but twice the
         number of activated parameters.

Qwen3-14B-Base & Qwen3-30B-A3B-Base The evaluation of the Qwen3-14B-Base and Qwen3-30B-
A3B-Base is compared against baselines of similar sizes, including Gemma-3-12B Base, Qwen2.5-14B
Base. Similarly, we also introduce two strong baselines: (1) Qwen2.5-Turbo (Yang et al., 2024b), which
has 42B parameters and 6B activated parameters. Note that its activated parameters are twice those of
Qwen3-30B-A3B-Base. (2) Qwen2.5-32B-Base, which has 11 times the activated parameters of Qwen3-
30B-A3B and more than twice that of Qwen3-14B. The results are shown in Table 5, where we can draw
the following conclusions.

     (1) Compared with the similar-sized models, Qwen3-14B-Base significantly performs better than
         Qwen2.5-14B-Base and Gemma-3-12B-Base on all 15 benchmarks.
     (2) Similarly, Qwen3-14B-Base also achieves very competitive results compared to Qwen2.5-32B-Base
         with less than half of the parameters.
     (3) With only 1/5 activated non-embedding parameters, Qwen3-30B-A3B significantly outperforms
         Qwen2.5-14B-Base on all tasks, and achieves comparable performance to Qwen3-14B-Base and
         Qwen2.5-32B-Base, which brings us significant advantages in inference and training costs.

Qwen3-8B / 4B / 1.7B / 0.6B-Base For edge-side models, we take similar-sized Qwen2.5, Llama-3, and
Gemma-3 base models as the baselines. The results can be seen in Table 6, Table 7, and Table 8. All Qwen3
8B / 4B / 1.7B / 0.6B-Base models continue to maintain strong performance across nearly all benchmarks.
Notably, Qwen3-8B / 4B / 1.7B-Base models even outperform larger size Qwen2.5-14B / 7B / 3B Base
models on over half of the benchmarks, especially on STEM-related and coding benchmarks, reflecting
the significant improvement of the Qwen3 models.

4    Post-training



                                  Stage 1                         Stage 2
       Base Models
                            Long-CoT Cold Start                Reasoning RL




                                  Stage 3                        Stage 4              Qwen3-235B-A22B
                           Thinking Mode Fusion                 General RL              Qwen3-32B
     Flagship Models
    Lightweight Models

                                                                                    Qwen3-30B-A3B
       Base Models                      Strong-to-Weak Distillation
                                                                                 14B/8B/4B/1.7B/0.6B



                       Figure 1: Post-training pipeline of the Qwen3 series models.


                                                        9
The post-training pipeline of Qwen3 is strategically designed with two core objectives:

      (1) Thinking Control: This involves the integration of two distinct modes, namely the “non-thinking”
          and “thinking” modes, providing users with the flexibility to choose whether the model should
          engage in reasoning or not, and to control the depth of thinking by specifying a token budget for
          the thinking process.
      (2) Strong-to-Weak Distillation: This aims to streamline and optimize the post-training process
          for lightweight models. By leveraging the knowledge from large-scale models, we substantially
          reduce both the computational costs and the development efforts required for building smaller-
          scale models.

As illustrated in Figure 1, the flagship models in the Qwen3 series follow a sophisticated four-stage
training process. The first two stages focus on developing the models’ “thinking” abilities. The next two
stages aim to integrate strong “non-thinking” functionalities into the models.
Preliminary experiments suggest that directly distilling the output logits from teacher models into
lightweight student models can effectively enhance their performance while maintaining fine-grained
control over their reasoning processes. This approach eliminates the necessity of performing an exhaustive
four-stage training process individually for every small-scale model. It leads to better immediate
performance, as indicated by higher Pass@1 scores, and also improves the model’s ability of exploration,
as reflected in improved Pass@64 results. In addition, it achieves these gains with much greater training
efficiency, requiring only 1/10 of the GPU hours compared to the four-stage training method.
In the following sections, we present the four-stage training process and provide a detailed explanation
of the Strong-to-Weak Distillation approach.

4.1   Long-CoT Cold Start

We begin by curating a comprehensive dataset that spans a wide range of categories, including math,
code, logical reasoning, and general STEM problems. Each problem in the dataset is paired with verified
reference answers or code-based test cases. This dataset serves as the foundation for the “cold start”
phase of long Chain-of-Thought (long-CoT) training.
The dataset construction involves a rigorous two-phase filtering process: query filtering and response
filtering. In the query filtering phase, we use Qwen2.5-72B-Instruct to identify and remove queries that
are not easily verifiable. This includes queries containing multiple sub-questions or those asking for
general text generation. Furthermore, we exclude queries that Qwen2.5-72B-Instruct can answer correctly
without using CoT reasoning. This helps prevent the model from relying on superficial guessing and
ensures that only complex problems requiring deeper reasoning are included. Additionally, we annotate
each query’s domain using Qwen2.5-72B-Instruct to maintain balanced domain representation across the
dataset.
After reserving a validation query set, we generate N candidate responses for each remaining query
using QwQ-32B (Qwen Team, 2025). When QwQ-32B consistently fails to generate correct solutions,
human annotators manually assess the accuracy of the responses. For queries with positive Pass@N,
further stringent filtering criteria are applied to remove responses that (1) yield incorrect final answers,
(2) contain substantial repetition, (3) clearly indicate guesswork without adequate reasoning, (4) exhibit
inconsistencies between the thinking and summary contents, (5) involve inappropriate language mixing or
stylistic shifts, or (6) are suspected of being overly similar to potential validation set items. Subsequently,
a carefully selected subset of the refined dataset is used for the initial cold-start training of the reasoning
patterns. The objective at this stage is to instill foundational reasoning patterns in the model without
overly emphasizing immediate reasoning performance. This approach ensures that the model’s potential
is not limited, allowing for greater flexibility and improvement during the subsequent reinforcement
learning (RL) phase. To achieve this objective effectively, it is preferable to minimize both the number of
training samples and the training steps during this preparatory phase.

4.2   Reasoning RL

The query-verifier pairs used in the Reasoning RL stage must satisfy the following four criteria: (1) They
were not used during the cold-start phase. (2) They are learnable for the cold-start model. (3) They are
as challenging as possible. (4) They cover a broad range of sub-domains. We ultimately collect a total
of 3,995 query-verifier pairs, and employed GRPO (Shao et al., 2024) to update the model parameters.
We observe that using a large batch size and a high number of rollouts per query, along with off-policy
training to improve sample efficiency, is beneficial to the training process. We have also addressed how
to balance exploration and exploitation by controlling the model’s entropy to increase steadily or remain


                                                      10
Table 9: Examples of SFT data for thinking and non-thinking modes during the thinking mode fusion
stage. For the thinking mode, the /think flag can be omitted since it represents the default behavior. This
feature has been implemented in the chat template1 supported by the Hugging Face’s tokenizer, where
the thinking mode can be disabled using an additional parameter enable thinking=False.

                        Thinking Mode                Non-Thinking Mode
                        <|im start|>user             <|im start|>user
                        {query} /think<|im end|>     {query} /no think<|im end|>
                        <|im start|>assistant        <|im start|>assistant
                        <think>                      <think>
                        {thinking content}
                        </think>                     </think>

                        {response}<|im end|>         {response}<|im end|>


stable, which is crucial for maintaining stable training. As a result, we achieve consistent improvements
in both training reward and validation performance over the course of a single RL run, without any
manual intervention on hyperparameters. For instance, the AIME’24 score of the Qwen3-235B-A22B
model increases from 70.1 to 85.1 over a total of 170 RL training steps.

4.3   Thinking Mode Fusion

The goal of the Thinking Mode Fusion stage is to integrate the “non-thinking” capabilities into the
previously developed “thinking” model. This approach allows developers to manage and control
reasoning behaviors, while also reducing the cost and complexity of deploying separate models for
thinking and non-thinking tasks. To achieve this, we conduct continual supervised fine-tuning (SFT)
on the Reasoning RL model and design a chat template to fuse the two modes. Moreover, we find that
models capable of handling both modes proficiently perform consistently well under different thinking
budgets.

Construction of SFT data. The SFT dataset combines both the “thinking” and “non-thinking” data.
To ensure that the performance of the Stage 2 model is not compromised by the additional SFT, the
“thinking” data is generated via rejection sampling on Stage 1 queries using the Stage 2 model itself. The
“non-thinking” data, on the other hand, is carefully curated to cover a diverse range of tasks, including
coding, mathematics, instruction-following, multilingual tasks, creative writing, question answering,
and role-playing. Additionally, we employ automatically generated checklists for assessing the response
quality of “non-thinking” data. To enhance the performance on tasks with low-resource languages, we
particularly increase the proportion of translation tasks.

Chat Template Design. To better integrate the two modes and enable users to dynamically switch the
model’s thinking process, we design chat templates for Qwen3, as shown in Table 9. Specifically, for
samples in thinking mode and non-thinking mode, we introduce /think and /no think flags in the user
query or system message, respectively. This allows the model to follow the user’s input and select the
appropriate thinking mode accordingly. For non-thinking mode samples, we retain an empty thinking
block in the assistant’s response. This design ensures internal format consistency within the model and
allows developers to prevent the model from engaging in thinking behavior by concatenating an empty
think block in the chat template. By default, the model operates in thinking mode; therefore, we add
some thinking mode training samples where the user queries do not include /think flags. For more
complex multi-turn dialogs, we randomly insert multiple /think and /no think flags into users’ queries,
with the model response adhering to the last flag encountered.

Thinking Budget. An additional advantage of Thinking Mode Fusion is that, once the model learns to
respond in both non-thinking and thinking modes, it naturally develops the ability to handle intermediate
cases—generating responses based on incomplete thinking. This capability lays the foundation for
implementing budget control over the model’s thinking process. Specifically, when the length of the
model’s thinking reaches a user-defined threshold, we manually halt the thinking process and insert
the stop-thinking instruction: “Considering the limited time by the user, I have to give the
solution based on the thinking directly now.\n</think>.\n\n”. After this instruction is inserted,
the model proceeds to generate a final response based on its accumulated reasoning up to that point. It
is worth noting that this ability is not explicitly trained but emerges naturally as a result of applying
Thinking Mode Fusion.



                                                    11
4.4   General RL

The General RL stage aims to broadly enhance the models’ capabilities and stability across diverse
scenarios. To facilitate this, we have established a sophisticated reward system covering over 20 distinct
tasks, each with customized scoring criteria. These tasks specifically target enhancements in the following
core capabilities:

       • Instruction Following: This capability ensures that models accurately interpret and follow user
         instructions, including requirements related to content, format, length, and the use of structured
         output, delivering responses that align with user expectations.
       • Format Following: In addition to explicit instructions, we expect the model to adhere to specific
         formatting conventions. For instance, it should respond appropriately to the /think and /
         no think flags by switching between thinking and non-thinking modes, and consistently use
         designated tokens (e.g., <think> and </think>) to separate the thinking and response parts in
         the final output.
       • Preference Alignment: For open-ended queries, preference alignment focuses on improving the
         model’s helpfulness, engagement, and style, ultimately delivering a more natural and satisfying
         user experience.
       • Agent Ability: This involves training the model to correctly invoke tools via designated interfaces.
         During the RL rollout, the model is allowed to perform complete multi-turn interaction cycles
         with real environment execution feedback, thereby improving its performance and stability in
         long-horizon decision-making tasks.
       • Abilities for Specialized Scenarios: In more specialized scenarios, we design tasks tailored to the
         specific context. For example, in Retrieval-Augmented Generation (RAG) tasks, we incorporate
         reward signals to guide the model toward generating accurate and contextually appropriate
         responses, thereby minimizing the risk of hallucination.

To provide feedback for the aforementioned tasks, we utilized three distinct types of rewards:

      (1) Rule-based Reward: The rule-based reward has been widely used in the reasoning RL stage,
          and is also useful for general tasks such as instruction following (Lambert et al., 2024) and format
          adherence. Well-designed rule-based rewards can assess the correctness of model outputs with
          high precision, preventing issues like reward hacking.
      (2) Model-based Reward with Reference Answer: In this approach, we provide a reference answer
          for each query and prompt Qwen2.5-72B-Instruct to score the model’s response based on this
          reference. This method allows for more flexible handling of diverse tasks without requiring strict
          formatting, avoiding false negatives that can occur with purely rule-based rewards.
      (3) Model-based Reward without Reference Answer: Leveraging human preference data, we train
          a reward model to assign scalar scores to model responses. This approach, which does not
          depend on a reference answer, can handle a broader range of queries while effectively enhancing
          the model’s engagement and helpfulness.

4.5   Strong-to-Weak Distillation

The Strong-to-Weak Distillation pipeline is specifically designed to optimize lightweight models, encom-
passing 5 dense models (Qwen3-0.6B, 1.7B, 4B, 8B, and 14B) and one MoE model (Qwen3-30B-A3B). This
approach enhances model performance while effectively imparting robust mode-switching capabilities.
The distillation process is divided into two primary phases:

      (1) Off-policy Distillation: At this initial phase, we combine the outputs of teacher models generated
          with both /think and /no think modes for response distillation. This helps lightweight student
          models develop basic reasoning skills and the ability to switch between different modes of
          thinking, laying a solid foundation for the next on-policy training phase.
      (2) On-policy Distillation: In this phase, the student model generates on-policy sequences for
          fine-tuning. Specifically, prompts are sampled, and the student model produces responses in
          either /think or /no think mode. The student model is then fine-tuned by aligning its logits
          with those of a teacher model (Qwen3-32B or Qwen3-235B-A22B) to minimize the KL divergence.

4.6   Post-training Evaluation

To comprehensively evaluate the quality of instruction-tuned models, we adopted automatic benchmarks
to assess model performance under both thinking and non-thinking modes. These benchmarks are


                                                      12
Table 10: Multilingual benchmarks and the included languages. The languages are identified in IETF
language tags.

 Benchmark        # Langs    Languages
 Multi-IF             8      en, es, fr, hi, it, pt, ru, zh
 INCLUDE             44      ar, az, be, bg, bn, de, el, es, et, eu, fa, fi, fr, he, hi, hr, hu, hy, id, it, ja, ka,
                             kk, ko, lt, mk, ml, ms, ne, nl, pl, pt, ru, sq, sr, ta, te, tl, tr, uk, ur, uz, vi, zh
 MMMLU               14      ar, bn, de, en, es, fr, hi, id, it, ja, ko, pt, sw, zh
 MT-AIME2024         55      af, ar, bg, bn, ca, cs, cy, da, de, el, en, es, et, fa, fi, fr, gu, he, hi, hr, hu, id,
                             it, ja, kn, ko, lt, lv, mk, ml, mr, ne, nl, no, pa, pl, pt, ro, ru, sk, sl, so, sq, sv,
                             sw, ta, te, th, tl, tr, uk, ur, vi, zh-Hans, zh-Hant
 PolyMath            18      ar, bn, de, en, es, fr, id, it, ja, ko, ms, pt, ru, sw, te, th, vi, zh
 MLogiQA             10      ar, en, es, fr, ja, ko, pt, th, vi, zh


categorized into several dimensions:

     • General Tasks: We utilize benchmarks including MMLU-Redux (Gema et al., 2024), GPQA-
       Diamond (Rein et al., 2023), C-Eval (Huang et al., 2023), and LiveBench (2024-11-25) (White et al.,
       2024). For GPQA-Diamond, we sample 10 times for each query and report the averaged accuracy.
     • Alignment Tasks: To evaluate how well the model aligns with human preferences, we employ
       a suite of specialized benchmarks. For instruction-following performance, we report the strict-
       prompt accuracy of IFEval (Zhou et al., 2023). To assess alignment with human preferences on
       general topics, we utilize Arena-Hard (Li et al., 2024) and AlignBench v1.1 (Liu et al., 2023b). For
       writing tasks, we rely on Creative Writing V3 (Paech, 2024) and WritingBench (Wu et al., 2025) to
       evaluate the model’s proficiency and creativity.
     • Math & Text Reasoning: For evaluating mathematical and logical reasoning skills, we employ
       high-level math benchmarks including MATH-500 (Lightman et al., 2023), AIME’24 and AIME’25
       (AIME, 2025), and text reasoning tasks including ZebraLogic (Lin et al., 2025) and AutoLogi
       (Zhu et al., 2025). For AIME problems, each year’s questions include Part I and Part II, totaling
       30 questions. For each question, we sample 64 times and take the average accuracy as the final
       score.
     • Agent & Coding: To test the model’s proficiency in coding and agent-based tasks, we use BFCL
       v3 (Yan et al., 2024), LiveCodeBench (v5, 2024.10-2025.02) (Jain et al., 2024), and Codeforces
       Ratings from CodeElo (Quan et al., 2025). For BFCL, all Qwen3 models are evaluated using the
       FC format, and yarn was used to deploy the models to a context length of 64k for Multi-Turn
       evaluation. Some baselines are derived from the BFCL leaderboard, taking the higher scores
       between FC and Prompt formats. For models not reported on the leaderboard, the Prompt
       formats are evaluated. For LiveCodeBench, for the non-thinking mode, we use the officially
       recommended prompt, while for the thinking mode, we adjust the prompt template to allow
       the model to think more freely, by removing the restriction You will not return anything
       except for the program. To evaluate the performance gap between models and competitive
       programming experts, we use CodeForces to calculate Elo ratings. In our benchmark, each
       problem is solved by generating up to eight independent reasoning attempts.
     • Multilingual Tasks: For multilingual capabilities, we evaluate four kinds of tasks: instruction
       following, knowledge, mathematics, and logical reasoning. Instruction following is assessed
       using Multi-IF (He et al., 2024), which focuses on 8 key languages. Knowledge assessment
       consisted of two types: regional knowledge evaluated through INCLUDE (Romanou et al.,
       2024), covering 44 languages, and general knowledge assessed with MMMLU (OpenAI, 2024)
       across 14 languages, excluding the unoptimized Yoruba language; for these two benchmarks,
       we sample only 10% of the original data to improve evaluation efficiency. The mathematics task
       employ MT-AIME2024 (Son et al., 2025), encompassing 55 languages, and PolyMath (Wang et al.,
       2025), which includes 18 languages. Logical reasoning is evaluated using MlogiQA, covering 10
       languages, sourced from Zhang et al. (2024).

For all Qwen3 models in the thinking mode, we utilize a sampling temperature of 0.6, a top-p value
of 0.95, and a top-k value of 20. Additionally, for Creative Writing v3 and WritingBench, we apply a
presence penalty of 1.5 to encourage the generation of more diverse content. For Qwen3 models in the
non-thinking mode, we configure the sampling hyperparameters with temperature = 0.7, top-p = 0.8,
top-k = 20, and presence penalty = 1.5. For both the thinking and non-thinking modes, we set the max
output length to 32,768 tokens, except AIME’24 and AIME’25 where we extend this length to 38,912
tokens to provide sufficient thinking space.


                                                      13
  Table 11: Comparison among Qwen3-235B-A22B (Thinking) and other reasoning baselines. The
  highest and second-best scores are shown in bold and underlined, respectively.

                                                                           Grok-3-Beta
                                           OpenAI-o1 DeepSeek-R1                       Gemini2.5-Pro Qwen3-235B-A22B
                                                                             (Think)
              Architecture                       -            MoE               -              -              MoE
              # Activated Params                 -             37B              -              -              22B
              # Total Params                     -            671B              -              -              235B
              MMLU-Redux                       92.8            92.9            -              93.7            92.7
  General     GPQA-Diamond                     78.0            71.5           80.2            84.0            71.1
   Tasks      C-Eval                           85.5            91.8            -              82.9            89.6
              LiveBench 2024-11-25             75.7            71.6            -              82.4            77.1
              IFEval strict prompt             92.6            83.3             -             89.5            83.4
              Arena-Hard                       92.1            92.3             -             96.4            95.6
Alignment
              AlignBench v1.1                  8.86            8.76             -             9.03            8.94
  Tasks
              Creative Writing v3              81.7            85.5             -             86.0            84.6
              WritingBench                     7.69            7.71             -             8.09            8.03
            MATH-500                           96.4            97.3                           98.8            98.0
            AIME’24                            74.3            79.8           83.9            92.0            85.7
Math & Text
            AIME’25                            79.2            70.0           77.3            86.7            81.5
 Reasoning
            ZebraLogic                         81.0            78.7            -              87.4            80.3
            AutoLogi                           79.8            86.1            -              85.4            89.0
              BFCL v3                              67.8         56.9           -              62.9            70.8
 Agent &
              LiveCodeBench v5                     63.9         64.3          70.6            70.4            70.7
 Coding
              CodeForces (Rating / Percentile) 1891 / 96.7% 2029 / 98.1%       -          2001 / 97.9%    2056 / 98.2%
             Multi-IF                          48.8            67.7             -             77.8            71.9
             INCLUDE                           84.6            82.7             -             85.1            78.7
Multilingual MMMLU 14 languages                88.4            86.4             -             86.9            84.3
  Tasks      MT-AIME2024                       67.4            73.5             -             76.9            80.8
             PolyMath                          38.9            47.1             -             52.2            54.7
             MLogiQA                           75.5            73.8             -             75.6            77.1

  Table 12: Comparison among Qwen3-235B-A22B (Non-thinking) and other non-reasoning baselines.
  The highest and second-best scores are shown in bold and underlined, respectively.

                                              GPT-4o                Qwen2.5-72B LLaMA-4
                                                        DeepSeek-V3                       Qwen3-235B-A22B
                                            -2024-11-20              -Instruct  -Maverick
               Architecture                          -         MoE            Dense          MoE            MoE
               # Activated Params                    -         37B             72B           17B            22B
               # Total Params                        -         671B            72B           402B           235B
               MMLU-Redux                       87.0           89.1            86.8           91.8           89.2
   General     GPQA-Diamond                     46.0           59.1            49.0           69.8           62.9
    Tasks      C-Eval                           75.5           86.5            84.7           83.5           86.1
               LiveBench 2024-11-25             52.2           60.5            51.4           59.5           62.5
               IFEval strict prompt             86.5           86.1            84.1           86.7           83.2
               Arena-Hard                       85.3           85.5            81.2           82.7           96.1
  Alignment
               AlignBench v1.1                  8.42           8.64            7.89           7.97           8.91
    Tasks
               Creative Writing v3              81.1           74.0            61.8           61.3           80.4
               WritingBench                     7.11           6.49            7.06           5.46           7.70
             MATH-500                           77.2           90.2            83.6           90.6           91.2
             AIME’24                            11.1           39.2            18.9           38.5           40.1
 Math & Text
             AIME’25                             7.6           28.8            15.0           15.9           24.7
  Reasoning
             ZebraLogic                         27.4           42.1            26.6           40.0           37.7
             AutoLogi                           65.9           76.1            66.1           75.2           83.3
               BFCL v3                              72.5        57.6            63.4          52.9           68.0
   Agent &
               LiveCodeBench v5                     32.7        33.1            30.7          37.2           35.3
   Coding
               CodeForces (Rating / Percentile) 864 / 35.4% 1134 / 54.1%    859 / 35.0%   712 / 24.3%    1387 / 75.7%
              Multi-IF                          65.6           55.6            65.3           75.5           70.2
              INCLUDE                           78.8           76.7            69.6           80.9           75.6
 Multilingual MMMLU 14 languages                80.3           81.1            76.9           82.5           79.8
   Tasks      MT-AIME2024                        9.2           20.9            12.7           27.0           32.4
              PolyMath                          13.7           20.4            16.9           26.1           27.0
              MLogiQA                           57.4           58.9            59.3           59.9           67.6



                                                             14
Summary of Evaluation Results From the evaluation results, we summarize several key conclusions of
the finalized Qwen3 models as follows:
    (1) Our flagship model, Qwen3-235B-A22B, demonstrates the state-of-the-art overall performance
        among open-source models in both the thinking and non-thinking modes, surpassing strong
        baselines such as DeepSeek-R1 and DeepSeek-V3. Qwen3-235B-A22B is also highly competitive
        to closed-source leading models, such as OpenAI-o1, Gemini2.5-Pro, and GPT-4o, showcasing its
        profound reasoning capabilities and comprehensive general abilities.
    (2) Our flagship dense model, Qwen3-32B, outperforms our previous strongest reasoning model,
        QwQ-32B, in most of the benchmarks, and performs comparably to the closed-source OpenAI-o3-
        mini, indicating its compelling reasoning capabilities. Qwen3-32B is also remarkably performant
        in the non-thinking mode and surpasses our previous flagship non-reasoning dense model,
        Qwen2.5-72B-Instruct.
    (3) Our lightweight models, including Qwen3-30B-A3B, Qwen3-14B, and other smaller dense ones,
        possess consistently superior performance to the open-source models with a close or larger
        amount of parameters, proving the success of our Strong-to-Weak Distillation approach.
The detailed results are as follows.

Qwen3-235B-A22B For our flagship model Qwen3-235B-A22B, we compare it with the leading reason-
ing and non-reasoning models. For the thinking mode, we take OpenAI-o1 (OpenAI, 2024), DeepSeek-R1
(Guo et al., 2025), Grok-3-Beta (Think) (xAI, 2025), and Gemini2.5-Pro (DeepMind, 2025) as the reasoning
baselines. For the non-thinking mode, we take GPT-4o-2024-11-20 (OpenAI, 2024), DeepSeek-V3 (Liu
et al., 2024a), Qwen2.5-72B-Instruct (Yang et al., 2024b), and LLaMA-4-Maverick (Meta-AI, 2025) as the
non-reasoning baselines. We present the evaluation results in Table 11 and 12.

    (1) From Table 11, with only 60% activated and 35% total parameters, Qwen3-235B-A22B (Thinking)
        outperforms DeepSeek-R1 on 17/23 the benchmarks, particularly on the reasoning-demanded
        tasks (e.g., mathematics, agent, and coding), demonstrating the state-of-the-art reasoning capabil-
        ities of Qwen3-235B-A22B among open-source models. Moreover, Qwen3-235B-A22B (Thinking)
        is also highly competitive to the closed-source OpenAI-o1, Grok-3-Beta (Think), and Gemini2.5-
        Pro, substantially narrowing the gap in the reasoning capabilities between open-source and
        close-source models.
    (2) From Table 12, Qwen3-235B-A22B (Non-thinking) exceeds the other leading open-source models,
        including DeepSeek-V3, LLaMA-4-Maverick, and our previous flagship model Qwen2.5-72B-
        Instruct, and also surpasses the closed-source GPT-4o-2024-11-20 in 18/23 the benchmarks,
        indicating its inherent strong capabilities even when not enhanced with the deliberate thinking
        process.

Qwen3-32B For our flagship dense model, Qwen3-32B, we take DeepSeek-R1-Distill-Llama-70B, OpenAI-
o3-mini (medium), and our previous strongest reasoning model, QwQ-32B (Qwen Team, 2025), as the
baselines in the thinking mode. We also take GPT-4o-mini-2024-07-18, LLaMA-4-Scout, and our previ-
ous flagship model, Qwen2.5-72B-Instruct, as the baselines in the non-thinking mode. We present the
evaluation results in Table 13 and 14.

    (1) From Table 13, Qwen3-32B (Thinking) outperforms QwQ-32B on 17/23 the benchmarks, making
        it the new state-of-the-art reasoning model at the sweet size of 32B. Moreover, Qwen3-32B (Think-
        ing) also competes with the closed-source OpenAI-o3-mini (medium) with better alignment and
        multilingual performance.
    (2) From Table 14, Qwen3-32B (Non-thinking) exhibits superior performance to all the baselines
        on almost all the benchmarks. Particularly, Qwen3-32B (Non-thinking) performs on par with
        Qwen2.5-72B-Instruct on the general tasks with significant advantages on the alignment, multi-
        lingual, and reasoning-related tasks, again proving the fundamental improvements of Qwen3
        over our previous Qwen2.5 series models.

Qwen3-30B-A3B & Qwen3-14B For Qwen3-30B-A3B and Qwen3-14B, we compare them with DeepSeek-
R1-Distill-Qwen-32B and QwQ-32B in the thinking mode, and Phi-4 (Abdin et al., 2024), Gemma-3-27B-IT
(Team et al., 2025), and Qwen2.5-32B-Instruct in the non-thinking mode, respectively. We present the
evaluation results in Table 15 and 16.

    (1) From Table 15, Qwen3-30B-A3B and Qwen3-14B (Thinking) are both highly competitive to
        QwQ-32B, especially on the reasoning-related benchmarks. It is noteworthy that Qwen3-30B-
        A3B achieves comparable performance to QwQ-32B with a smaller model size and less than


                                                   15
Table 13: Comparison among Qwen3-32B (Thinking) and other reasoning baselines. The highest and
second-best scores are shown in bold and underlined, respectively.

                                                    DeepSeek-R1                          OpenAI-o3-mini
                                                                          QwQ-32B                              Qwen3-32B
                                                  -Distill-Llama-70B                       (medium)
               Architecture                                Dense            Dense                  -                 Dense
               # Activated Params                           70B              32B                   -                  32B
               # Total Params                               70B              32B                   -                  32B
               MMLU-Redux                                   89.3             90.0              90.0                  90.9
  General      GPQA-Diamond                                 65.2             65.6              76.8                  68.4
   Tasks       C-Eval                                       71.8             88.4              75.1                  87.3
               LiveBench 2024-11-25                         54.5             72.0              70.0                  74.9
               IFEval strict prompt                         79.3             83.9              91.5                  85.0
               Arena-Hard                                   60.6             89.5              89.0                  93.8
 Alignment
               AlignBench v1.1                              6.74             8.70              8.38                  8.72
   Tasks
               Creative Writing v3                          62.1             82.4              74.8                  81.0
               WritingBench                                 6.08             7.86              7.52                  7.90
               MATH-500                                     94.5             98.0              98.0                  97.2
               AIME’24                                      70.0             79.5              79.6                  81.4
Math & Text
               AIME’25                                      56.3             69.5              74.8                  72.9
 Reasoning
               ZebraLogic                                   71.3             76.8              88.9                  88.8
               AutoLogi                                     83.5             88.1              86.3                  87.3
               BFCL v3                                      49.3             66.4              64.6               70.3
 Agent &
               LiveCodeBench v5                             54.5             62.7              66.3               65.7
 Coding
               CodeForces (Rating / Percentile)         1633 / 91.4%     1982 / 97.7%      2036 / 98.1%       1977 / 97.7%
               Multi-IF                                     57.6             68.3              48.4                  73.0
               INCLUDE                                      62.1             69.7              73.1                  73.7
Multilingual   MMMLU 14 languages                           69.6             80.9              79.3                  80.6
  Tasks        MT-AIME2024                                  29.3             68.0              73.9                  75.0
               PolyMath                                     29.4             45.9              38.6                  47.4
               MLogiQA                                      60.3             75.5              71.1                  76.3

Table 14: Comparison among Qwen3-32B (Non-thinking) and other non-reasoning baselines. The
highest and second-best scores are shown in bold and underlined, respectively.

                                                          GPT-4o-mini    LLaMA-4        Qwen2.5-72B
                                                                                                          Qwen3-32B
                                                           -2024-07-18    -Scout         -Instruct
                     Architecture                                  -        MoE            Dense            Dense
                     # Activated Params                            -        17B             72B              32B
                     # Total Params                                -        109B            72B              32B
                     MMLU-Redux                               81.5          86.3            86.8              85.7
       General       GPQA-Diamond                             40.2          57.2            49.0              54.6
        Tasks        C-Eval                                   66.3          78.2            84.7              83.3
                     LiveBench 2024-11-25                     41.3          47.6            51.4              59.8
                     IFEval strict prompt                     80.4          84.7            84.1              83.2
                     Arena-Hard                               74.9          70.5            81.2              92.8
      Alignment
                     AlignBench v1.1                          7.81          7.49            7.89              8.58
        Tasks
                     Creative Writing v3                      70.3          55.0            61.8              78.3
                     WritingBench                             5.98          5.49            7.06              7.54
                     MATH-500                                 78.2          82.6            83.6              88.6
                     AIME’24                                  8.1           28.6            18.9              31.0
     Math & Text
                     AIME’25                                  8.8           10.0            15.0              20.2
      Reasoning
                     ZebraLogic                               20.1          24.2            26.6              29.2
                     AutoLogi                                 52.6          56.8            66.1              78.5
                     BFCL v3                                  64.0           45.4           63.4              63.0
       Agent &
                     LiveCodeBench v5                         27.9           29.8           30.7              31.3
       Coding
                     CodeForces (Rating / Percentile)     1113 / 52.6%   981 / 43.7%    859 / 35.0%       1353 / 71.0%
                     Multi-IF                                 62.4          64.2            65.3              70.7
                     INCLUDE                                  66.0          74.1            69.6              70.9
     Multilingual    MMMLU 14 languages                       72.1          77.5            76.9              76.5
       Tasks         MT-AIME2024                               6.0          19.1            12.7              24.1
                     PolyMath                                 12.0          20.9            16.9              22.5
                     MLogiQA                                  42.6          53.9            59.3              62.9


                                                                   16
Table 15: Comparison among Qwen3-30B-A3B / Qwen3-14B (Thinking) and other reasoning baselines.
The highest and second-best scores are shown in bold and underlined, respectively.

                                                      DeepSeek-R1
                                                                                QwQ-32B         Qwen3-14B       Qwen3-30B-A3B
                                                    -Distill-Qwen-32B
                 Architecture                               Dense                Dense            Dense             MoE
                 # Activated Params                          32B                  32B              14B               3B
                 # Total Params                              32B                  32B              14B              30B
                 MMLU-Redux                                 88.2                   90.0            88.6              89.5
  General        GPQA-Diamond                               62.1                   65.6            64.0              65.8
   Tasks         C-Eval                                     82.2                   88.4            86.2              86.6
                 LiveBench 2024-11-25                       45.6                   72.0            71.3              74.3
                 IFEval strict prompt                       72.5                   83.9            85.4              86.5
                 Arena-Hard                                 60.8                   89.5            91.7              91.0
 Alignment
                 AlignBench v1.1                            7.25                   8.70            8.56              8.70
   Tasks
                 Creative Writing v3                        55.0                   82.4            80.3              79.1
                 WritingBench                               6.13                   7.86            7.80              7.70
                 MATH-500                                   94.3                   98.0            96.8              98.0
                 AIME’24                                    72.6                   79.5            79.3              80.4
Math & Text
                 AIME’25                                    49.6                   69.5            70.4              70.9
 Reasoning
                 ZebraLogic                                 69.6                   76.8            88.5              89.5
                 AutoLogi                                   74.6                   88.1            89.2              88.7
                 BFCL v3                                  53.5                     66.4           70.4               69.1
  Agent &
                 LiveCodeBench v5                         54.5                     62.7           63.5               62.6
  Coding
                 CodeForces (Rating / Percentile)     1691 / 93.4%             1982 / 97.7%   1766 / 95.3%       1974 / 97.7%
                 Multi-IF                                   31.3                   68.3            74.8              72.2
                 INCLUDE                                    68.0                   69.7            71.7              71.9
Multilingual     MMMLU 14 languages                         78.6                   80.9            77.9              78.4
  Tasks          MT-AIME2024                                44.6                   68.0            73.3              73.9
                 PolyMath                                   35.1                   45.9            45.8              46.1
                 MLogiQA                                    63.3                   75.5            71.1              70.1

Table 16: Comparison among Qwen3-30B-A3B / Qwen3-14B (Non-thinking) and other non-reasoning
baselines. The highest and second-best scores are shown in bold and underlined, respectively.

                                                                   Gemma-3       Qwen2.5-32B
                                                    Phi-4                                    Qwen3-14B Qwen3-30B-A3B
                                                                    -27B-IT       -Instruct
               Architecture                         Dense           Dense            Dense          Dense            MoE
               # Activated Params                    14B             27B              32B            14B              3B
               # Total Params                        14B             27B              32B            14B             30B
               MMLU-Redux                           85.3                82.6          83.9           82.0             84.1
  General      GPQA-Diamond                         56.1                42.4          49.5           54.8             54.8
   Tasks       C-Eval                               66.9                66.6          80.6           81.0             82.9
               LiveBench 2024-11-25                 41.6                49.2          50.0           59.6             59.4
               IFEval strict prompt                 62.1                80.6          79.5           84.8             83.7
               Arena-Hard                           75.4                86.8          74.5           86.3             88.0
Alignment
               AlignBench v1.1                      7.61                7.80          7.71           8.52             8.55
  Tasks
               Creative Writing v3                  51.2                82.0          54.6           73.1             68.1
               WritingBench                         5.73                7.22          5.90           7.24             7.22
            MATH-500                                80.8                90.0          84.6           90.0             89.8
            AIME’24                                 22.9                32.6          18.8           31.7             32.8
Math & Text
            AIME’25                                 17.3                24.0          12.8           23.3             21.6
 Reasoning
            ZebraLogic                              32.3                24.6          26.1           33.0             33.2
            AutoLogi                                66.2                64.2          65.5           82.0             81.5
               BFCL v3                              47.0         59.1                 62.8           61.5             58.6
 Agent &
               LiveCodeBench v5                     25.2         26.9                 26.4           29.0             29.8
 Coding
               CodeForces (Rating / Percentile) 1280 / 65.3% 1063 / 49.3%         903 / 38.2%    1200 / 58.6%     1267 / 64.1%
             Multi-IF                               49.5                69.8          63.2           72.9             70.8
             INCLUDE                                65.3                71.4          67.5           67.8             67.8
Multilingual MMMLU 14 languages                     74.7                76.1          74.2           72.6             73.8
  Tasks      MT-AIME2024                            13.1                23.0          15.3           23.2             24.6
             PolyMath                               17.4                20.3          18.3           22.0             23.3
             MLogiQA                                53.1                58.5          58.0           58.9             53.3


                                                                   17
    Table 17: Comparison among Qwen3-8B / Qwen3-4B (Thinking) and other reasoning baselines. The
    highest and second-best scores are shown in bold and underlined, respectively.

                                                           DeepSeek-R1             DeepSeek-R1
                                                                                                        Qwen3-4B            Qwen3-8B
                                                         -Distill-Qwen-14B       -Distill-Qwen-32B
                    Architecture                                 Dense                Dense                 Dense            Dense
                    # Activated Params                            14B                  32B                   4B               8B
                    # Total Params                                14B                  32B                   4B               8B
                    MMLU-Redux                                    84.1                 88.2                 83.7               87.5
      General       GPQA-Diamond                                  59.1                 62.1                 55.9               62.0
       Tasks        C-Eval                                        78.1                 82.2                 77.5               83.4
                    LiveBench 2024-11-25                          52.3                 45.6                 63.6               67.1
                    IFEval strict prompt                          72.6                 72.5                 81.9               85.0
                    Arena-Hard                                    48.0                 60.8                 76.6               85.8
     Alignment
                    AlignBench v1.1                               7.43                 7.25                 8.30               8.46
       Tasks
                    Creative Writing v3                           54.2                 55.0                 61.1               75.0
                    WritingBench                                  6.03                 6.13                 7.35               7.59
                    MATH-500                                      93.9                 94.3                 97.0               97.4
                    AIME’24                                       69.7                 72.6                 73.8               76.0
    Math & Text
                    AIME’25                                       44.5                 49.6                 65.6               67.3
     Reasoning
                    ZebraLogic                                    59.1                 69.6                 81.0               84.8
                    AutoLogi                                      78.6                 74.6                 87.9               89.1
                    BFCL v3                                       49.5                 53.5                65.9                68.1
      Agent &
                    LiveCodeBench v5                              45.5                 54.5                54.2                57.5
      Coding
                    CodeForces (Rating / Percentile)          1574 / 89.1%         1691 / 93.4%        1671 / 92.8%        1785 / 95.6%
                    Multi-IF                                      29.8                 31.3                 66.3               71.2
                    INCLUDE                                       59.7                 68.0                 61.8               67.8
    Multilingual    MMMLU 14 languages                            73.8                 78.6                 69.8               74.4
      Tasks         MT-AIME2024                                   33.7                 44.6                 60.7               65.4
                    PolyMath                                      28.6                 35.1                 40.0               42.7
                    MLogiQA                                       53.6                 63.3                 65.9               69.0

    Table 18: Comparison among Qwen3-8B / Qwen3-4B (Non-thinking) and other non-reasoning baselines.
    The highest and second-best scores are shown in bold and underlined, respectively.

                                                LLaMA-3.1-8B        Gemma-3      Qwen2.5-7B Qwen2.5-14B
                                                                                                        Qwen3-4B                Qwen3-8B
                                                  -Instruct          -12B-IT      -Instruct  -Instruct
             Architecture                              Dense             Dense      Dense         Dense             Dense         Dense
             # Activated Params                         8B                12B        7B            14B               4B            8B
             # Total Params                             8B                12B        7B            14B               4B            8B
             MMLU-Redux                                61.7              77.8        75.4            80.0           77.3              79.5
  General    GPQA-Diamond                              32.8              40.9        36.4            45.5           41.7              39.3
   Tasks     C-Eval                                    52.0              61.1        76.2            78.0           72.2              77.9
             LiveBench 2024-11-25                      26.0              43.7        34.9            42.2           48.4              53.5
             IFEval strict prompt                      75.0              80.2        71.2            81.0           81.2              83.0
             Arena-Hard                                30.1              82.6        52.0            68.3           66.2              79.6
Alignment
             AlignBench v1.1                           6.01              7.77        7.27            7.67           8.10              8.38
  Tasks
             Creative Writing v3                       52.8              79.9        49.8            55.8           53.6              64.5
             WritingBench                              4.57              7.05        5.82            5.93           6.85              7.15
            MATH-500                                   54.8              85.6        77.6            83.4           84.8              87.4
            AIME’24                                     6.3              22.4         9.1            15.2           25.0              29.1
Math & Text
            AIME’25                                    2.7               18.8        12.1            13.6           19.1              20.9
 Reasoning
            ZebraLogic                                 12.8              17.8        12.0            19.7           35.2              26.7
            AutoLogi                                   30.9              58.9        42.9            57.4           76.3              76.5
             BFCL v3                                 49.6              50.6          55.8          58.7            57.6        60.2
 Agent &
             LiveCodeBench v5                        10.8              25.7          14.4          21.9            21.3        22.8
 Coding
             CodeForces (Rating / Percentile)    473 / 14.9%       462 / 14.7%    191 / 0.0%   904 / 38.3%     842 / 33.7% 1110 / 52.4%
             Multi-IF                                  52.1              65.6        47.7            55.5           61.3              69.2
             INCLUDE                                   34.0              65.3        53.6            63.5           53.8              62.5
Multilingual MMMLU 14 languages                        44.4              70.0        61.4            70.3           61.7              66.9
  Tasks      MT-AIME2024                                0.4              16.7        5.5              8.5           13.9              16.6
             PolyMath                                   5.8              17.6        11.9            15.0           16.6              18.8
             MLogiQA                                   41.9              54.5        49.5            51.3           49.9              51.4


                                                                         18
Table 19: Comparison among Qwen3-1.7B / Qwen3-0.6B (Thinking) and other reasoning baselines.
The highest and second-best scores are shown in bold and underlined, respectively.

                                               DeepSeek-R1          DeepSeek-R1
                                                                                          Qwen3-0.6B       Qwen3-1.7B
                                             -Distill-Qwen-1.5B   -Distill-Llama-8B
                   Architecture                    Dense                  Dense             Dense            Dense
                   # Activated Params               1.5B                   8B                0.6B             1.7B
                   # Total Params                   1.5B                   8B                0.6B             1.7B
                   MMLU-Redux                       45.4                  66.4               55.6             73.9
     General       GPQA-Diamond                     33.8                  49.0               27.9             40.1
      Tasks        C-Eval                           27.1                  50.4               50.4             68.1
                   LiveBench 2024-11-25             24.9                  40.6               30.3             51.1
                   IFEval strict prompt             39.9                  59.0               59.2             72.5
                   Arena-Hard                        4.5                  17.6                8.5             43.1
   Alignment
                   AlignBench v1.1                  5.00                  6.24               6.10             7.60
     Tasks
                   Creative Writing v3              16.4                  51.1               30.6             48.0
                   WritingBench                     4.03                  5.42               5.61             7.02
                   MATH-500                         83.9                  89.1               77.6             93.4
                   AIME’24                          28.9                  50.4               10.7             48.3
  Math & Text
                   AIME’25                          22.8                  27.8               15.1             36.8
   Reasoning
                   ZebraLogic                        4.9                  37.1               30.3             63.2
                   AutoLogi                         19.1                  63.4               61.6             83.2
    Agent &        BFCL v3                          14.0                  21.5               46.4             56.6
    Coding         LiveCodeBench v5                 13.2                  42.5               12.3             33.2
                   Multi-IF                         13.3                  27.0               36.1             51.2
                   INCLUDE                          21.9                  34.5               35.9             51.8
  Multilingual     MMMLU 14 languages               27.3                  40.1               43.1             59.1
    Tasks          MT-AIME2024                      12.4                  13.2                7.8             36.1
                   PolyMath                         14.5                  10.8               11.4             25.2
                   MLogiQA                          29.0                  32.8               40.9             56.0


Table 20: Comparison among Qwen3-1.7B / Qwen3-0.6B (Non-thinking) and other non-reasoning
baselines. The highest and second-best scores are shown in bold and underlined, respectively.

                                      Gemma-3            Qwen2.5-1.5B Qwen2.5-3B
                                              Phi-4-mini                         Qwen3-0.6B Qwen3-1.7B
                                       -1B-IT             -Instruct    -Instruct
               Architecture               Dense     Dense         Dense           Dense         Dense          Dense
               # Activated Params          1.0B      3.8B          1.5B            3.1B          0.6B           1.7B
               # Total Params              1.0B      3.8B          1.5B            3.1B          0.6B           1.7B
               MMLU-Redux                 33.3       67.9         50.7            64.4              44.6        64.4
  General      GPQA-Diamond               19.2       25.2         29.8            30.3              22.9        28.6
   Tasks       C-Eval                     28.5       40.0         53.3            68.2              42.6        61.0
               LiveBench 2024-11-25       14.4       25.3         18.0            23.8              21.8        35.6
               IFEval strict prompt       54.5       68.6         42.5            58.2              54.5        68.2
               Arena-Hard                 17.8       32.8          9.0            23.7               6.5        36.9
Alignment
               AlignBench v1.1             5.3       6.00         5.60            6.49              5.60        7.20
  Tasks
               Creative Writing v3        52.8       10.3         31.5            42.8              28.4        43.6
               WritingBench               5.18       4.05         4.67            5.55              5.13        6.54
            MATH-500                      46.4       67.6         55.0            67.2              55.2        73.0
            AIME’24                        0.9        8.1          0.9             6.7               3.4        13.4
Math & Text
            AIME’25                        0.8        5.3          0.4             4.2               2.6         9.8
 Reasoning
            ZebraLogic                     1.9        2.7          3.4             4.8               4.2        12.8
            AutoLogi                      16.4       28.8         22.5            29.9              37.4        59.8
 Agent &       BFCL v3                    16.3       31.3         47.8            50.4              44.1        52.2
 Coding        LiveCodeBench v5            1.8       10.4          5.3             9.2              3.6         11.6
             Multi-IF                     32.8       40.5         20.2            32.3              33.3        44.7
             INCLUDE                      32.7       43.8         33.1            43.8              34.4        42.6
Multilingual MMMLU 14 languages           32.5       51.4         40.4            51.8              37.1        48.3
  Tasks      MT-AIME2024                   0.2        0.9         0.7              1.6               1.5         4.9
             PolyMath                      3.5        6.7          5.0             7.3               4.6        10.3
             MLogiQA                      31.8       39.5         40.9            39.5              37.3        41.1



                                                            19
          1/10 activated parameters, demonstrating the effectiveness of our Strong-to-Weak Distillation
          approach in endowing lightweight models with profound reasoning capabilities.
      (2) From Table 16, Qwen3-30B-A3B and Qwen3-14B (Non-thinking) surpass the non-reasoning
          baselines in most of the benchmarks. They exceed our previous Qwen2.5-32B-Instruct model
          with significantly fewer activated and total parameters, allowing for more efficient and cost-
          effective performance.

Qwen3-8B / 4B / 1.7B / 0.6B For Qwen3-8B and Qwen3-4B, we compare them with DeepSeek-R1-Distill-
Qwen-14B and DeepSeek-R1-Distill-Qwen-32B in the thinking mode, and LLaMA-3.1-8B-Instruct (Dubey
et al., 2024), Gemma-3-12B-IT (Team et al., 2025), Qwen2.5-7B-Instruct, and Qwen2.5-14B-Instruct in the
non-thinking mode, respectively. For Qwen3-1.7B and Qwen3-0.6B, we compare them with DeepSeek-
R1-Distill-Qwen-1.5B and DeepSeek-R1-Distill-Llama-8B in the thinking mode, and Gemma-3-1B-IT,
Phi-4-mini, Qwen2.5-1.5B-Instruct, and Qwen2.5-3B-Instruct in the non-thinking mode, respectively. We
present the evaluation results of Qwen3-8B and Qwen3-4B in Table 17 and 18 and those of Qwen3-1.7B
and Qwen3-0.6B in Table 19 and 20, respectively. Overall, these edge-side models exhibit impressive
performance and outperform baselines even with more parameters, including our previous Qwen2.5
models, in either the thinking or the non-thinking mode. These results, once again, demonstrate the
efficacy of our Strong-to-Weak Distillation approach, making it possible for us to build the lightweight
Qwen3 models with remarkably reduced costs and efforts.

4.7   Discussion

The Effectiveness of Thinking Budget To verify that Qwen3 can enhance its intelligence level by
leveraging an increased thinking budget, we adjust the allocated thinking budget on four benchmarks
across Mathematics, Coding, and STEM domains. The resulting scaling curves are presented in Figure 2,
Qwen3 demonstrates scalable and smooth performance improvements correlated to the allocated thinking
budget. Moreover, we observe that if we further extend the output length beyond 32K, the model’s
performance is expected to improve further in the future. We leave this exploration as future work.




            Figure 2: Performance of Qwen3-235B-A22B with respect to the thinking budget.

The Effectiveness and Efficiency of On-Policy Distillation We evaluate the effectiveness and efficiency
of on-policy distillation by comparing the performance and computational cost—measured in GPU
hours—after undergoing distillation versus direct reinforcement learning, both starting from the same
off-policy distilled 8B checkpoint. For simplicity, we focus solely on math and code-related queries in


                                                   20
this comparison. The results, summarized in Table 21, show that distillation achieves significantly better
performance than reinforcement learning while requiring approximately only 1/10 of the GPU hours.
Furthermore, distillation from teacher logits enables the student model to expand its exploration space
and enhance its reasoning potential, as evidenced by the improved pass@64 scores on the AIME’24
and AIME’25 benchmarks after distillation, compared to the initial checkpoint. In contrast, reinforce-
ment learning does not lead to any improvement in pass@64 scores. These observations highlight the
advantages of leveraging a stronger teacher model in guiding student model learning.

Table 21: Comparison of reinforcement learning and on-policy distillation on Qwen3-8B. Numbers in
parentheses indicate pass@64 scores.

                                                                       LiveCodeBench MMLU GPQA       GPU
   Method                       AIME’24     AIME’25 MATH500
                                                                             v5      -Redux -Diamond Hours
   Off-policy Distillation  55.0 (90.0) 42.8 (83.3)        92.4               42.0            86.4         55.6         -
   + Reinforcement Learning 67.6 (90.0) 55.5 (83.3)        94.8               52.9            86.9         61.3      17,920
   + On-policy Distillation 74.4 (93.3) 65.5 (86.7)        97.0               60.3            88.3         63.3      1,800


The Effects of Thinking Mode Fusion and General RL To evaluate the effectiveness of Thinking Mode
Fusion and General Reinforcement Learning (RL) during the post-training, we conduct evaluations on
various stages of the Qwen-32B model. In addition to the datasets mentioned earlier, we introduce several
in-house benchmarks to monitor other capabilities. These benchmarks include:

     • CounterFactQA: Contains counterfactual questions where the model needs to identify that the
       questions are not factual and avoid generating hallucinatory answers.
     • LengthCtrl: Includes creative writing tasks with length requirements; the final score is based on
       the difference between the generated content length and the target length.
     • ThinkFollow: Involves multi-turn dialogues with randomly inserted /think and /no think
       flags to test whether the model can correctly switch thinking modes based on user queries.
     • ToolUse: Evaluates the stability of the model in single-turn, multi-turn, and multi-step tool calling
       processes. The score includes accuracy in intent recognition, format accuracy, and parameter
       accuracy during the tool calling process.

Table 22: Performance of Qwen3-32B after Reasoning RL (Stage 2), Thinking Mode Fusion (Stage 3), and
General RL (Stage 4). Benchmarks with * are in-house datasets.

                                               Stage 2                Stage 3                         Stage 4
                                            Reasoning RL       Thinking Mode Fusion                  General RL
                     Benchmark                 Thinking           Thinking     Non-Thinking     Thinking     Non-Thinking

                     LiveBench 2024-11-25       68.6           70.9+2.3             57.1        74.9+4.0      59.8+2.8
        General
                     Arena-Hard                 86.8           89.4+2.6             88.5        93.8+4.4      92.8+4.3
         Tasks
                     CounterFactQA*             50.4           61.3+10.9            64.3        68.1+6.8      66.4+2.1
                     IFEval strict prompt       73.0              78.4+5.4          78.4        85.0+6.6    83.2+4.8
       Instruction
                     Multi-IF                   61.4              64.6+3.2          65.2        73.0+8.4    70.7+5.5
        & Format
                     LengthCtrl*                62.6              70.6+8.0          84.9        73.5+2.9    87.3+2.4
        Following
                     ThinkFollow*                -                           88.7                    98.9+10.2
                     BFCL v3                    69.0              68.4-0.6          61.5       70.3+1.9       63.0+1.5
         Agent
                     ToolUse*                   63.3              70.4+7.1          73.2       85.5+15.1      86.5+13.3
      Knowledge &    MMLU-Redux                 91.4              91.0-0.4          86.7        90.9-0.1      85.7-1.0
        STEM         GPQA-Diamond               68.8              69.0+0.2          50.4        68.4-0.6      54.6+4.3
        Math &       AIME’24                    83.8              81.9-1.9          28.5        81.4-0.5      31.0+2.5
        Coding       LiveCodeBench v5           68.4              67.2-1.2          31.1        65.7-1.5      31.3+0.2


The results are shown in Table 22, where we can draw the following conclusions:

    (1) Stage 3 integrates the non-thinking mode into the model, which already possesses thinking
        capabilities after the first two stages of training. The ThinkFollow benchmark score of 88.7
        indicates that the model has developed an initial ability to switch between modes, though it still
        occasionally makes errors. Stage 3 also enhances the model’s general and instruction-following
        capabilities in thinking mode, with CounterFactQA improving by 10.9 points and LengthCtrl by
        8.0 points.


                                                          21
    (2) Stage 4 further strengthens the model’s general, instruction-following, and agent capabilities
        in both thinking and non-thinking modes. Notably, the ThinkFollow score improves to 98.9,
        ensuring accurate mode switching.
    (3) For Knowledge, STEM, Math, and Coding tasks, Thinking Mode Fusion and General RL do
        not bring significant improvements. In contrast, for challenging tasks like AIME’24 and Live-
        CodeBench, the performance in thinking mode actually decreases after these two training stages.
        We conjecture this degradation is due to the model being trained on a broader range of general
        tasks, which may compromise its specialized capabilities in handling complex problems. During
        the development of Qwen3, we choose to accept this performance trade-off to enhance the
        model’s overall versatility.

5   Conclusion

In this technical report, we introduce Qwen3, the latest version of the Qwen series. Qwen3 features
both thinking mode and non-thinking mode, allowing users to dynamically manage the number of
tokens used for complex thinking tasks. The model was pre-trained on an extensive dataset containing
36 trillion tokens, enabling it to understand and generate text in 119 languages and dialects. Through a
series of comprehensive evaluations, Qwen3 has shown strong performance across a range of standard
benchmarks for both pre-trained and post-trained models, including tasks related to code generation,
mathematics, reasoning, and agents.
In the near future, our research will focus on several key areas. We will continue to scale up pretraining by
using data that is both higher in quality and more diverse in content. At the same time, we will work on
improving model architecture and training methods for the purposes of effective compression, scaling to
extremely long contexts, etc. In addition, we plan to increase computational resources for reinforcement
learning, with a particular emphasis on agent-based RL systems that learn from environmental feedback.
This will allow us to build agents capable of tackling complex tasks that require inference time scaling.

6   Authors

Core Contributors: An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen
Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng
Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang,
Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, Le Yu, Lianghao
Deng, Mei Li, Mingfeng Xue, Mingze Li, Pei Zhang, Peng Wang, Qin Zhu, Rui Men, Ruize Gao, Shixuan
Liu, Shuang Luo, Tianhao Li, Tianyi Tang, Wenbiao Yin, Xingzhang Ren, Xinyu Wang, Xinyu Zhang,
Xuancheng Ren, Yang Fan, Yang Su, Yichang Zhang, Yinger Zhang, Yu Wan, Yuqiong Liu, Zekun Wang,
Zeyu Cui, Zhenru Zhang, Zhipeng Zhou, Zihan Qiu
Contributors: Bei Chen, Biao Sun, Bin Luo, Bin Zhang, Binghai Wang, Bowen Ping, Boyi Deng, Chang
Si, Chaojie Yang, Chen Cheng, Chenfei Wu, Chengpeng Li, Chengyuan Li, Fan Hong, Guobin Zhao,
Hang Zhang, Hangrui Hu, Hanyu Zhao, Hao Lin, Hao Xiang, Haoyan Huang, Hongkun Hao, Humen
Zhong, Jialin Wang, Jiandong Jiang, Jianqiang Wan, Jianyuan Zeng, Jiawei Chen, Jie Zhang, Jin Xu, Jinkai
Wang, Jinyang Zhang, Jinzheng He, Jun Tang, Kai Zhang, Ke Yi, Keming Lu, Keqin Chen, Langshi Chen,
Le Jiang, Lei Zhang, Linjuan Wu, Man Yuan, Mingkun Yang, Minmin Sun, Mouxiang Chen, Na Ni,
Nuo Chen, Peng Liu, Peng Wang, Peng Zhu, Pengcheng Zhang, Pengfei Wang, Qiaoyu Tang, Qing Fu,
Qiuyue Wang, Rong Zhang, Rui Hu, Runji Lin, Shen Huang, Shuai Bai, Shutong Jiang, Sibo Song, Siqi
Zhang, Song Chen, Tao He, Ting He, Tingfeng Hui, Wei Ding, Wei Liao, Wei Lin, Wei Zhang, Weijia
Xu, Wenbin Ge, Wenmeng Zhou, Wenyuan Yu, Xianyan Jia, Xianzhong Shi, Xiaodong Deng, Xiaoming
Huang, Xiaoyuan Li, Ximing Zhou, Xinyao Niu, Xipin Wei, Xuejing Liu, Yang Liu, Yang Yao, Yang Zhang,
Yanpeng Li, Yantao Liu, Yidan Zhang, Yikai Zhu, Yiming Wang, Yiwen Hu, Yong Jiang, Yong Li, Yongan
Yue, Yu Guan, Yuanzhi Zhu, Yunfei Chu, Yunlong Feng, Yuxin Zhou, Yuxuan Cai, Zeyao Ma, Zhaohai Li,
Zheng Li, Zhengyang Tang, Zheren Fu, Zhi Li, Zhibo Yang, Zhifang Guo, Zhipeng Zhang, Zhiying Xu,
Zhiyu Yin, Zhongshen Zeng, Zile Qiao, Ziye Meng, Zongmeng Zhang




                                                     22
A     Appendix

A.1     Additional Evaluation Results

A.1.1     Long-Context Ability

                   Table 23: Performance of Qwen3 Models on the RULER benchmark.

                                                                   RULER
                             Model
                                                    Avg.   4K     8K     16K    32K    64K    128K
                             Qwen2.5-7B-Instruct    85.4   96.7   95.1   93.7   89.4   82.3   55.1
                             Qwen2.5-14B-Instruct   91.4   97.7   96.8   95.9   93.4   86.7   78.1
                             Qwen2.5-32B-Instruct   92.9   96.9   97.1   95.5   95.5   90.3   82.0
                             Qwen2.5-72B-Instruct   95.1   97.7   97.2   97.7   96.5   93.0   88.4
                             Qwen3-4B               85.2   95.1   93.6   91.0   87.8   77.8   66.0
                             Qwen3-8B               89.1   96.3   96.0   91.8   91.2   82.1   77.4
              Non-thinking   Qwen3-14B              94.6   98.0   97.8   96.4   96.1   94.0   85.1
                 Mode        Qwen3-32B              93.7   98.4   96.0   96.2   94.4   91.8   85.6
                             Qwen3-30B-A3B          91.6   96.5   97.0   95.3   92.4   89.1   79.2
                             Qwen3-235B-A22B        95.0   97.7   97.2   96.4   95.1   93.3   90.6
                             Qwen3-4B               83.5   92.7   88.7   86.5   83.2   83.0   67.2
                             Qwen3-8B               84.4   94.7   94.4   86.1   80.8   78.3   72.0
                Thinking     Qwen3-14B              90.1   95.4   93.6   89.8   91.9   90.6   79.0
                 Mode        Qwen3-32B              91.0   94.7   93.7   91.6   92.5   90.0   83.5
                             Qwen3-30B-A3B          86.6   94.1   92.7   89.0   86.6   82.1   75.0
                             Qwen3-235B-A22B        92.2   95.1   94.8   93.0   92.3   92.0   86.0

For evaluating long-context processing capabilities, we report the results on the RULER benchmark (Hsieh
et al., 2024) in Table 23. To enable length extrapolation, we utilize YARN (Peng et al., 2023) with a
scaling factor=4. In thinking mode, we set the thinking budget to 8192 tokens to mitigate overly
verbose reasoning on the extremely long inputs.
The results show that:

        1. In non-thinking mode, Qwen3 outperforms Qwen2.5 models of a similar size in long-context
           processing tasks.
        2. In thinking mode, the model’s performance slightly degrades. We hypothesize that the thinking
           content does not provide significant benefits for these retrieval tasks, which do not rely on
           reasoning and may instead interfere with the retrieval process. We are committed to enhancing
           the long-context capability in the thinking mode in future versions.

A.1.2     Multilingual Ability
Table 24-35 presents the detailed benchmark scores across various languages, including Spanish, French,
Portuguese, Italian, Arabic, Japanese, Korean, Indonesian, Russian, Vietnamese, German, and Thai. The
results of these tables demonstrate that the Qwen3 series models achieve competitive performance across
all evaluated benchmarks, showcasing their strong multilingual capabilities.
To evaluate the performance of Qwen3 across a broader range of languages, we utilize Belebele (Bandarkar
et al., 2023), a benchmark for natural language understanding. We conduct evaluations on 80 supported
languages from the benchmark, excluding 42 unoptimized languages, as shown in Table 36 (organized
by language family). The performance comparison between Qwen3 and other baseline models on
the Belebele benchmark is presented in Table 37. The results show that Qwen3 achieves comparable
performance to similarly-sized Gemma models while outperforming Qwen2.5 significantly.




                                                      23
Table 24: Benchmark scores for language: Spanish (es). The highest and second-best scores are shown
in bold and underlined, respectively.

             Model                  Multi-IF MLogiQA INCLUDE MMMLU MT-AIME24 PolyMath Average
             Gemini2.5-Pro           80.1    70.0        96.4   88.7       90.0       54.4     79.9
             QwQ-32B                 70.0    75.0        81.8   84.5       76.7       52.2     73.4
             Qwen3-235B-A22B         74.2    76.2        89.1   86.7       86.7       57.3     78.4
             Qwen3-32B               74.7    68.8        90.9   82.8       76.7       51.8     74.3
  Thinking   Qwen3-30B-A3B           74.9    71.2        80.0   81.9       76.7       48.5     72.2
   Mode      Qwen3-14B               76.2    67.5        83.6   81.1       73.3       50.3     72.0
             Qwen3-8B                74.1    70.0        78.2   79.2       70.0       43.7     69.2
             Qwen3-4B                69.1    68.8        72.7   75.7       66.7       42.3     65.9
             Qwen3-1.7B              56.0    55.0        72.7   64.5       46.7       30.2     54.2
             Qwen3-0.6B              39.2    42.5        54.5   48.8       13.3       14.3     35.4
             GPT-4o-2024-1120        67.5    52.5        89.1   80.6       10.0       15.5     52.5
             Gemma-3-27b-IT          73.5    57.5        89.1   77.7       30.0       22.4     58.4
             Qwen2.5-72B-Instruct    66.7    61.3        80.0   80.1       20.0       18.8     54.5
             Qwen3-235B-A22B         71.7    66.2        83.6   83.7       33.3       29.5     61.3
             Qwen3-32B               72.1    65.0        83.6   80.4       26.7       24.7     58.8
Non-thinking
             Qwen3-30B-A3B           72.1    53.8        85.5   78.3       33.3       25.0     58.0
   Mode
             Qwen3-14B               76.2    63.7        78.2   77.4       40.0       25.0     60.1
             Qwen3-8B                73.1    50.0        80.0   73.7       16.7       21.3     52.5
             Qwen3-4B                65.8    50.0        60.0   68.3       13.3       17.3     45.8
             Qwen3-1.7B              47.9    43.8        50.9   54.3       10.0       11.6     36.4
             Qwen3-0.6B              35.5    37.5        43.6   39.5        3.3        5.8     27.5




Table 25: Benchmark scores for language: French (fr). The highest and second-best scores are shown in
bold and underlined, respectively.

             Model                  Multi-IF MLogiQA INCLUDE MMMLU MT-AIME24 PolyMath Average
             Gemini2.5-Pro           80.5    73.8        85.7   88.3       80.0       52.8     76.8
             QwQ-32B                 72.4    78.8        76.2   84.0       80.0       49.4     73.5
             Qwen3-235B-A22B         77.3    78.8        85.7   86.6       86.7       57.4     78.8
             Qwen3-32B               76.7    81.2        76.2   82.1       83.3       47.1     74.4
  Thinking   Qwen3-30B-A3B           75.2    67.5        83.3   81.0       76.7       46.9     71.8
   Mode      Qwen3-14B               77.6    71.2        73.8   80.4       73.3       44.2     70.1
             Qwen3-8B                73.8    66.2        85.7   77.9       70.0       45.3     69.8
             Qwen3-4B                71.3    63.7        71.4   74.5       66.7       40.2     64.6
             Qwen3-1.7B              52.6    56.2        54.8   64.8       60.0       28.7     52.8
             Qwen3-0.6B              36.1    48.8        47.6   48.4        6.7       14.0     33.6
             GPT-4o-2024-1120        67.8    56.2        85.7   81.8       10.0       15.3     52.8
             Gemma-3-27b-IT          73.9    57.5        73.8   78.3       23.3       21.5     54.7
             Qwen2.5-72B-Instruct    72.1    55.0        81.0   80.2       26.7       15.7     55.1
             Qwen3-235B-A22B         73.2    65.0        88.1   81.1       36.7       28.1     62.0
             Qwen3-32B               75.8    60.0        73.8   79.5       30.0       23.0     57.0
Non-thinking
             Qwen3-30B-A3B           75.6    52.5        69.0   77.9       26.7       27.3     54.8
   Mode
             Qwen3-14B               78.4    63.7        73.8   75.1       33.3       24.4     58.1
             Qwen3-8B                71.9    52.5        71.4   71.7       20.0       21.4     51.5
             Qwen3-4B                64.2    47.5        61.9   67.6       20.0       19.2     46.7
             Qwen3-1.7B              46.1    43.8        64.3   53.2       3.3        11.6     37.0
             Qwen3-0.6B              32.8    35.0        38.1   39.4       6.7         4.6     26.1




                                                    24
Table 26: Benchmark scores for language: Portuguese (pt). The highest and second-best scores are
shown in bold and underlined, respectively.

             Model                  Multi-IF MLogiQA INCLUDE MMMLU MT-AIME24 PolyMath Average
             Gemini2.5-Pro           80.5          73.8           83.9          88.9          73.3          52.2          75.4
             QwQ-32B                 70.5          70.0           80.4          84.0          80.0          48.7          72.3
             Qwen3-235B-A22B         73.6          78.8           78.6          86.2          86.7          58.3          77.0
             Qwen3-32B               74.1          76.2           76.8          82.6          80.0          52.4          73.7
  Thinking   Qwen3-30B-A3B           76.1          71.2           71.4          81.0          76.7          49.3          71.0
   Mode      Qwen3-14B               77.3          68.8           75.0          81.6          83.3          46.7          72.1
             Qwen3-8B                73.9          67.5           75.0          78.6          56.7          44.8          66.1
             Qwen3-4B                70.6          62.5           71.4          75.1          73.3          44.2          66.2
             Qwen3-1.7B              55.6          60.0           53.6          64.6          46.7          28.2          51.4
             Qwen3-0.6B              38.7          33.8           42.9          47.5          10.0          12.7          30.9
             GPT-4o-2024-1120        66.8          57.5           78.6          80.7          10.0          15.0          51.4
             Gemma-3-27b-IT          72.9          55.0           75.0          77.1          33.3          20.9          55.7
             Qwen2.5-72B-Instruct    68.8          55.0           71.4          82.2          23.3          11.3          52.0
             Qwen3-235B-A22B         72.5          67.5           82.1          83.5          33.3          28.3          61.2
             Qwen3-32B               71.1          61.3           73.2          80.6          30.0          23.9          56.7
Non-thinking
             Qwen3-30B-A3B           72.3          47.5           67.9          77.8          26.7          24.0          52.7
   Mode
             Qwen3-14B               75.5          58.8           75.0          76.5          26.7          25.8          56.4
             Qwen3-8B                71.9          56.2           71.4          72.9          20.0          19.7          52.0
             Qwen3-4B                66.1          50.0           73.2          66.7          10.0          18.1          47.4
             Qwen3-1.7B              49.5          33.8           39.3          52.9           6.7          12.8          32.5
             Qwen3-0.6B              36.6          37.5           42.9          37.5           3.3           5.7          27.2




Table 27: Benchmark scores for language: Italian (it). The highest and second-best scores are shown in
bold and underlined, respectively.

                  Model                  Multi-IF INCLUDE MMMLU MT-AIME24 PolyMath Average
                  Gemini2.5-Pro             80.9          100.0          87.2          90.0          54.1          82.4
                  QwQ-32B                   71.2           96.4          84.9          76.7          49.3          75.7
                  Qwen3-235B-A22B           73.7          96.4           85.7          80.0          57.4          78.6
                  Qwen3-32B                 76.6           90.9          81.6          80.0          49.7          75.8
       Thinking   Qwen3-30B-A3B             75.9           94.5          81.9          80.0          48.1          76.1
        Mode      Qwen3-14B                 79.0           94.5          80.2          70.0          47.0          74.1
                  Qwen3-8B                  74.6          89.1           77.5          76.7          46.1          72.8
                  Qwen3-4B                  69.8          83.6           74.4          76.7          44.5          69.8
                  Qwen3-1.7B                54.6          74.5           64.2          53.3          29.6          55.2
                  Qwen3-0.6B                37.8          45.5           45.9          6.7           13.3          29.8
                  GPT-4o-2024-1120          67.6          98.2           80.7          13.3          15.2          55.0
                  Gemma-3-27b-IT            74.6          90.9           78.4          23.3          20.5          57.5
                  Qwen2.5-72B-Instruct      67.2          94.5           80.7          16.7          16.7          55.2
                  Qwen3-235B-A22B           72.9          92.7           82.6          33.3          28.6          62.0
                  Qwen3-32B                 71.4          92.7           79.5          30.0          23.0          59.3
     Non-thinking
                  Qwen3-30B-A3B             73.9          87.3           77.7          33.3          24.8          59.4
        Mode
                  Qwen3-14B                 75.8          89.1           75.7          26.7          27.6          59.0
                  Qwen3-8B                  72.1          85.5           72.9          13.3          23.8          53.5
                  Qwen3-4B                  63.0          78.2           67.8          23.3          19.3          50.3
                  Qwen3-1.7B                46.1          70.9           53.4           6.7          11.9          37.8
                  Qwen3-0.6B                35.1          43.6           39.0           0.0           4.5          24.4




                                                          25
Table 28: Benchmark scores for language: Arabic (ar). The highest and second-best scores are shown in
bold and underlined, respectively.

                 Model                  MLogiQA INCLUDE MMMLU MT-AIME24 PolyMath Average
                 Gemini2.5-Pro            75.0    89.3      87.8       76.7      52.6      76.3
                 QwQ-32B                  75.0    67.9      81.8       80.0      41.3      69.2
                 Qwen3-235B-A22B          80.0    71.4      83.6       76.7      53.7      73.1
                 Qwen3-32B                66.2    73.2      80.1       86.7      47.0      70.6
      Thinking   Qwen3-30B-A3B            66.2    66.1      77.2       83.3      47.3      68.0
       Mode      Qwen3-14B                71.2    67.9      77.4       83.3      46.6      69.3
                 Qwen3-8B                 65.0    67.9      74.4       76.7      44.9      65.8
                 Qwen3-4B                 62.5    55.4      67.7       66.7      41.2      58.7
                 Qwen3-1.7B               55.0    44.6      53.2       36.7      25.8      43.1
                 Qwen3-0.6B               40.0    41.1      38.9       10.0      11.7      28.3
                 GPT-4o-2024-1120         51.2    78.6      80.9       13.3      12.9      47.4
                 Gemma-3-27b-IT           56.2    62.5      74.4       26.7      22.8      48.5
                 Qwen2.5-72B-Instruct     56.2    66.1      77.2        6.7      14.7      44.2
                 Qwen3-235B-A22B          66.2    67.9      79.5       40.0      28.2      56.4
                 Qwen3-32B                55.0    69.6      75.7       23.3      25.4      49.8
    Non-thinking
                 Qwen3-30B-A3B            48.8    64.3      71.6       30.0      22.6      47.5
       Mode
                 Qwen3-14B                52.5    60.7      69.5       23.3      23.5      45.9
                 Qwen3-8B                 45.0    58.9      64.6       13.3      16.4      39.6
                 Qwen3-4B                 52.5    42.9      56.7       13.3      15.3      36.1
                 Qwen3-1.7B               31.2    37.5      43.6        3.3       9.4      25.0
                 Qwen3-0.6B               40.0    39.3      35.4        0.0       3.8      23.7




Table 29: Benchmark scores for language: Japanese (ja). The highest and second-best scores are shown
in bold and underlined, respectively.

                 Model                  MLogiQA INCLUDE MMMLU MT-AIME24 PolyMath Average
                 Gemini2.5-Pro            72.5    74.5      83.8       83.3      55.4      73.9
                 QwQ-32B                  73.8    86.3      82.3       53.3      39.9      67.1
                 Qwen3-235B-A22B          75.0    94.1      84.8       73.3      52.7      76.0
                 Qwen3-32B                70.0    90.2      80.2       76.7      47.7      73.0
      Thinking   Qwen3-30B-A3B            66.2    88.2      79.9       73.3      47.4      71.0
       Mode      Qwen3-14B                68.8    88.2      79.4       66.7      45.7      69.8
                 Qwen3-8B                 71.2    86.3      74.9       73.3      44.7      70.1
                 Qwen3-4B                 63.7    80.4      72.5       53.3      40.7      62.1
                 Qwen3-1.7B               53.8    74.5      61.8       36.7      28.5      51.1
                 Qwen3-0.6B               47.5    47.1      45.1       13.3      14.5      33.5
                 GPT-4o-2024-1120         60.0    92.2      81.9       10.0      12.5      51.3
                 Gemma-3-27b-IT           66.2    86.3      76.5       20.0      17.3      53.3
                 Qwen2.5-72B-Instruct     55.0    94.1      77.7       16.7      17.7      52.2
                 Qwen3-235B-A22B          67.5    92.2      80.9       26.7      26.9      58.8
                 Qwen3-32B                58.8    92.2      78.0       20.0      20.5      53.9
    Non-thinking
                 Qwen3-30B-A3B            51.2    82.4      74.9       30.0      20.6      51.8
       Mode
                 Qwen3-14B                55.0    84.3      73.8       33.3      19.8      53.2
                 Qwen3-8B                 47.5    82.4      69.9       20.0      18.5      47.7
                 Qwen3-4B                 46.2    76.5      64.8       13.3      15.1      43.2
                 Qwen3-1.7B               40.0    68.6      46.3        3.3      11.6      34.0
                 Qwen3-0.6B               37.5    37.3      37.9        3.3       3.7      23.9




                                                 26
Table 30: Benchmark scores for language: Korean (ko). The highest and second-best scores are shown in
bold and underlined, respectively.

                 Model                  MLogiQA INCLUDE MMMLU MT-AIME24 PolyMath Average
                 Gemini2.5-Pro            75.0          88.0          85.9          76.7          50.0          75.1
                 QwQ-32B                  76.2          72.0          81.8          60.0          40.0          66.0
                 Qwen3-235B-A22B          71.2          80.0          84.7          80.0          55.7          74.3
                 Qwen3-32B                71.2          74.0          79.2          80.0          48.5          70.6
      Thinking   Qwen3-30B-A3B            68.8          72.0          78.6          76.7          46.6          68.5
       Mode      Qwen3-14B                67.5          74.0          79.6          76.7          46.0          68.8
                 Qwen3-8B                 60.0          80.0          74.7          76.7          42.3          66.7
                 Qwen3-4B                 66.2          74.0          68.8          70.0          40.6          63.9
                 Qwen3-1.7B               53.8          66.0          57.8          43.3          25.2          49.2
                 Qwen3-0.6B               33.8          52.0          41.5          13.3          11.8          30.5
                 GPT-4o-2024-1120         63.7          80.0          80.5          13.3          12.9          50.1
                 Gemma-3-27b-IT           58.8          76.0          75.9          20.0          18.3          49.8
                 Qwen2.5-72B-Instruct     58.8          68.0          76.7           6.7          17.7          45.6
                 Qwen3-235B-A22B          63.7          76.0          79.8          33.3          27.9          56.1
                 Qwen3-32B                60.0          74.0          77.2          26.7          21.2          51.8
    Non-thinking
                 Qwen3-30B-A3B            52.5          72.0          72.5          16.7          20.7          46.9
       Mode
                 Qwen3-14B                52.5          68.0          73.3          20.0          18.7          46.5
                 Qwen3-8B                 52.5          76.0          66.5          23.3          16.3          46.9
                 Qwen3-4B                 46.2          74.0          59.9          13.3          16.6          42.0
                 Qwen3-1.7B               48.8          58.0          46.0           6.7           9.0          33.7
                 Qwen3-0.6B               40.0          52.0          36.9           0.0           5.5          26.9




Table 31: Benchmark scores for language: Indonesian (id). The highest and second-best scores are
shown in bold and underlined, respectively.

                      Model                  INCLUDE MMMLU MT-AIME24 PolyMath Average
                      Gemini2.5-Pro              80.0          86.3          83.3          51.3          75.2
                      QwQ-32B                    76.4          83.7          73.3          47.3          70.2
                      Qwen3-235B-A22B            80.0          87.2          80.0          53.5          75.2
                      Qwen3-32B                  80.0          82.0          76.7          45.6          71.1
           Thinking   Qwen3-30B-A3B              81.8          80.4          80.0          44.9          71.8
            Mode      Qwen3-14B                  78.2          79.6          70.0          45.3          68.3
                      Qwen3-8B                   72.7          77.7          70.0          43.8          66.0
                      Qwen3-4B                   70.9          72.3          66.7          41.2          62.8
                      Qwen3-1.7B                 63.6          61.2          36.7          26.8          47.1
                      Qwen3-0.6B                 36.4          46.6          10.0          12.6          26.4
                      GPT-4o-2024-1120           80.0          81.1          10.0          14.7          46.4
                      Gemma-3-27b-IT             76.4          75.9          13.3          22.6          47.0
                      Qwen2.5-72B-Instruct       74.5          78.8          10.0          16.6          45.0
                      Qwen3-235B-A22B            81.8          81.9          33.3          27.5          56.1
                      Qwen3-32B                  81.8          77.2          23.3          24.3          51.6
         Non-thinking
                      Qwen3-30B-A3B              70.9          76.4          30.0          25.9          50.8
            Mode
                      Qwen3-14B                  70.9          74.1          26.7          24.6          49.1
                      Qwen3-8B                   78.2          69.6          20.0          21.6          47.4
                      Qwen3-4B                   67.3          66.5          13.3          19.0          41.5
                      Qwen3-1.7B                 52.7          49.0          3.3           10.8          29.0
                      Qwen3-0.6B                 52.7          40.0          3.3            5.1          25.3




                                                        27
Table 32: Benchmark scores for language: Russian (ru). The highest and second-best scores are shown
in bold and underlined, respectively.

                      Model                   Multi-IF INCLUDE MT-AIME24 PolyMath Average
                      Gemini2.5-Pro             68.1        80.4   70.0     52.3    67.7
                      QwQ-32B                   61.2        73.2   76.7     43.6    63.7
                      Qwen3-235B-A22B           62.2        80.4   80.0     53.1    68.9
                      Qwen3-32B                 62.5        73.2   63.3     46.5    61.4
           Thinking   Qwen3-30B-A3B             60.7        76.8   73.3     45.4    64.0
            Mode      Qwen3-14B                 63.6        80.4   66.7     46.4    64.3
                      Qwen3-8B                  62.9        69.6   63.3     37.7    58.4
                      Qwen3-4B                  52.8        69.6   56.7     36.6    53.9
                      Qwen3-1.7B                37.8        46.4   20.0     22.8    31.8
                      Qwen3-0.6B                26.4        46.4    3.3      7.0    20.8
                       GPT-4o-2024-1120         52.0        80.4   20.0     13.7    41.5
                       Gemma-3-27b-IT           57.3        71.4   23.3     21.6    43.4
                       Qwen2.5-72B-Instruct     54.1        67.9   20.0     13.3    38.8
                       Qwen3-235B-A22B          56.7        75.0   40.0     26.1    49.4
                       Qwen3-32B                58.6        71.4   30.0     23.3    45.8
          Non-thinking
                       Qwen3-30B-A3B            58.0        73.2   30.0     21.1    45.6
             Mode
                       Qwen3-14B                60.3        71.4   26.7     24.2    45.6
                       Qwen3-8B                 59.3        58.9   20.0     22.8    40.2
                       Qwen3-4B                 46.1        58.9   13.3     17.8    34.0
                       Qwen3-1.7B               34.8        41.1    3.3     13.2    23.1
                       Qwen3-0.6B               25.5        46.4    0.0      5.8    19.4




Table 33: Benchmark scores for language: Vietnamese (vi). The highest and second-best scores are
shown in bold and underlined, respectively.

                      Model                   MLogiQA INCLUDE MT-AIME24 PolyMath Average
                      Gemini2.5-Pro             72.5        89.1   70.0     52.1    70.9
                      QwQ-32B                   71.2        69.1   70.0     49.2    64.9
                      Qwen3-235B-A22B           75.0        87.3   83.3     55.1    75.2
                      Qwen3-32B                 67.5        81.8   83.3     44.0    69.2
           Thinking   Qwen3-30B-A3B             68.8        78.2   76.7     46.1    67.4
            Mode      Qwen3-14B                 72.5        72.7   73.3     45.8    66.1
                      Qwen3-8B                  65.0        72.7   73.3     42.9    63.5
                      Qwen3-4B                  68.8        63.6   60.0     42.2    58.6
                      Qwen3-1.7B                52.5        61.8   30.0     26.9    42.8
                      Qwen3-0.6B                33.8        38.2    6.7      9.8    22.1
                      GPT-4o-2024-1120          57.5        81.8   10.0     13.0    40.6
                      Gemma-3-27b-IT            52.5        74.5   33.3     20.6    45.2
                      Qwen2.5-72B-Instruct      61.3        72.7   26.7     18.6    44.8
                      Qwen3-235B-A22B           70.0        83.6   36.7     27.1    54.4
                      Qwen3-32B                 60.0        81.8   23.3     21.8    46.7
         Non-thinking
                      Qwen3-30B-A3B             52.5        81.8   20.0     24.7    44.8
            Mode
                      Qwen3-14B                 63.7        67.3   20.0     21.6    43.2
                      Qwen3-8B                  48.8        65.5   20.0     19.1    38.4
                      Qwen3-4B                  48.8        65.5   20.0     19.0    38.3
                      Qwen3-1.7B                36.2        60.0    3.3     10.9    27.6
                      Qwen3-0.6B                30.0        36.4    3.3      3.9    18.4




                                                       28
Table 34: Benchmark scores for language: German (de). The highest and second-best scores are shown
in bold and underlined, respectively.

                      Model                  INCLUDE MMMLU MT-AIME24 PolyMath Average
                      Gemini2.5-Pro            50.0          85.6          86.7          53.8          69.0
                      QwQ-32B                  57.1          83.8          76.7          51.0          67.2
                      Qwen3-235B-A22B          71.4          86.0          83.3          55.4          74.0
                      Qwen3-32B                64.3          81.9          86.7          48.1          70.2
           Thinking   Qwen3-30B-A3B            64.3          81.9          80.0          46.6          68.2
            Mode      Qwen3-14B                57.1          80.9          70.0          48.1          64.0
                      Qwen3-8B                 64.3          78.1          66.7          43.6          63.2
                      Qwen3-4B                 57.1          74.0          73.3          43.1          61.9
                      Qwen3-1.7B               64.3          63.4          36.7          26.8          47.8
                      Qwen3-0.6B               57.1          47.6          10.0          13.7          32.1
                      GPT-4o-2024-1120         57.1          80.4          10.0          13.5          40.2
                      Gemma-3-27b-IT           57.1          76.1          26.7          20.2          45.0
                      Qwen2.5-72B-Instruct     64.3          79.9          16.7          19.3          45.0
                      Qwen3-235B-A22B          71.4          81.7          40.0          25.9          54.8
                      Qwen3-32B                57.1          77.2          30.0          21.9          46.6
         Non-thinking
                      Qwen3-30B-A3B            57.1          77.7          23.3          25.2          45.8
            Mode
                      Qwen3-14B                57.1          76.0          30.0          24.5          46.9
                      Qwen3-8B                 64.3          70.8          20.0          19.9          43.8
                      Qwen3-4B                 64.3          66.0          26.7          16.4          43.4
                      Qwen3-1.7B               42.9          53.2          10.0          10.6          29.2
                      Qwen3-0.6B               42.9          37.8          3.3            5.7          22.4




Table 35: Benchmark scores for language: Thai (th). The highest and second-best scores are shown in
bold and underlined, respectively.

                           Model                  MLogiQA MT-AIME24 PolyMath Average
                           Gemini2.5-Pro              73.8          80.0          50.7          68.2
                           QwQ-32B                    75.0          60.0          41.3          58.8
                           Qwen3-235B-A22B            73.8          86.7          53.6          71.4
                           Qwen3-32B                  73.8          76.7          46.9          65.8
                Thinking   Qwen3-30B-A3B              63.7          80.0          45.2          63.0
                 Mode      Qwen3-14B                  65.0          76.7          44.4          62.0
                           Qwen3-8B                   68.8          70.0          41.3          60.0
                           Qwen3-4B                   60.0          60.0          39.4          53.1
                           Qwen3-1.7B                 48.8          33.3          23.7          35.3
                           Qwen3-0.6B                 33.8          13.3          11.4          19.5
                           GPT-4o-2024-1120           52.5          10.0          11.9          24.8
                           Gemma-3-27b-IT             50.0          16.7          19.0          28.6
                           Qwen2.5-72B-Instruct       58.8           6.7          17.4          27.6
                           Qwen3-235B-A22B            61.3          23.3          27.6          37.4
                           Qwen3-32B                  61.3          13.3          22.2          32.3
              Non-thinking
                           Qwen3-30B-A3B              50.0          30.0          22.3          34.1
                 Mode
                           Qwen3-14B                  47.5          23.3          22.1          31.0
                           Qwen3-8B                   42.5          10.0          17.2          23.2
                           Qwen3-4B                   43.8          13.3          16.1          24.4
                           Qwen3-1.7B                 42.5           6.7           9.5          19.6
                           Qwen3-0.6B                 37.5           0.0           3.6          13.7




                                                      29
        Table 36: Language families and language codes supported by Qwen3 in Belebele Benchmark

   Language family # Langs Language code (ISO 639-3 ISO 15924)
                                  por Latn, deu Latn, tgk Cyrl, ces Latn, nob Latn, dan Latn, snd Arab, spa Latn,
                                  isl Latn, slv Latn, eng Latn, ory Orya, hrv Latn, ell Grek, ukr Cyrl, pan Guru,
   Indo-European          40      srp Cyrl, npi Deva, mkd Cyrl, guj Gujr, nld Latn, swe Latn, hin Deva, rus Cyrl,
                                  asm Beng, cat Latn, als Latn, sin Sinh, urd Arab, mar Deva, lit Latn, slk Latn,
                                  ita Latn, pol Latn, bul Cyrl, afr Latn, ron Latn, fra Latn, ben Beng, hye Armn
   Sino-Tibetan           3       zho Hans, mya Mymr, zho Hant
   Afro-Asiatic           8       heb Hebr, apc Arab, acm Arab, ary Arab, ars Arab, arb Arab, mlt Latn, erz Arab
   Austronesian           7       ilo Latn, ceb Latn, tgl Latn, sun Latn, jav Latn, war Latn, ind Latn
   Dravidian              4       mal Mlym, kan Knda, tel Telu, tam Taml
   Turkic                 4       kaz Cyrl, azj Latn, tur Latn, uzn Latn
   Tai-Kadai              2       tha Thai, lao Laoo
   Uralic                 3       fin Latn, hun Latn, est Latn
   Austroasiatic          2       vie Latn, khm Khmr
   Other                  7       eus Latn, kor Hang, hat Latn, swh Latn, kea Latn, jpn Jpan, kat Geor




Table 37: Comparison of Belebele Benchmark performance between Qwen3 and other baseline models.
Scores are highlighted with the highest in bold and the second-best underlined.
                                 Indo-   Sino- Afro-                                   Tai-
Model                                                   Austronesian Dravidian Turkic       Uralic Austroasiatic Other
                               European Tibetan Asiatic                               Kadai
Gemma-3-27B-IT                   89.2     86.3    85.9      84.1        83.5    86.8   81.0   91.0     86.5      87.0
Qwen2.5-32B-Instruct             85.5     82.3    80.4      70.6        67.8    80.8   74.5   87.0     79.0      72.6
QwQ-32B                          86.1     83.7    81.9      71.3        69.3    80.3   77.0   88.0     83.0      74.0
Qwen3-32B (Thinking)             90.7     89.7    84.8      86.7        84.5    89.3   83.5   91.3     88.0      83.1
Qwen3-32B (Non-thinking)         89.1     88.0    82.3      83.7        84.0    85.0   85.0   88.7     88.0      81.3
Gemma-3-12B-IT                   85.8     83.3    83.4      79.3        79.0    82.8   77.5   89.0     83.0      81.6
Qwen2.5-14B-Instruct             82.7     78.9    80.4      69.1        66.2    74.2   72.2   83.9     77.9      70.4
Qwen3-14B (Thinking)             88.6     87.3    82.4      82.4        81.0    83.8   83.5   91.0     82.5      81.7
Qwen3-14B (Non-thinking)         87.4     82.7    80.1      80.7        78.0    81.8   80.5   87.7     81.5      77.0
Gemma-3-4B-IT                    71.8     72.0    63.5      61.7        64.8    64.0   61.5   70.7     71.0      62.6
Qwen2.5-3B-Instruct              58.0     62.3    57.2      47.9        36.9    45.1   49.8   50.6     56.8      48.4
Qwen3-4B (Thinking)              82.2     77.7    74.1      73.0        74.3    76.3   68.5   83.0     74.5      67.9
Qwen3-4B (Non-thinking)          76.0     77.0    65.6      65.6        65.5    64.0   60.5   74.0     74.0      61.0
Gemma-3-1B-IT                    36.5     36.0    30.0      29.1        28.8    27.3   28.0   32.7     33.0      30.9
Qwen2.5-1.5B-Instruct            41.5     43.0    39.6      34.8        28.6    29.7   39.4   33.8     42.0      36.0
Qwen3-1.7B (Thinking)            69.7     66.0    59.4      58.6        52.8    57.8   53.5   70.3     63.5      53.4
Qwen3-1.7B (Non-thinking)        58.8     62.7    50.8      53.0        43.3    48.0   46.0   54.3     54.0      43.9




                                                           30
References
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael
 Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report. arXiv
 preprint arXiv:2412.08905, 2024.
AIME. AIME problems and solutions, 2025. URL https://artofproblemsolving.com/wiki/index.p
  hp/AIME Problems and Solutions.
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit
  Sanghai. GQA: Training generalized multi-query Transformer models from multi-head checkpoints. In
  EMNLP, pp. 4895–4901. Association for Computational Linguistics, 2023.
Chenxin An, Fei Huang, Jun Zhang, Shansan Gong, Xipeng Qiu, Chang Zhou, and Lingpeng Kong.
 Training-free long-context scaling of large language models. CoRR, abs/2402.17463, 2024.
Anthropic. Claude 3.7 Sonnet, 2025. URL https://www.anthropic.com/news/claude-3-7-sonnet.
Jacob Austin, Augustus Odena, Maxwell I. Nye, Maarten Bosma, Henryk Michalewski, David Dohan,
  Ellen Jiang, Carrie J. Cai, Michael Terry, Quoc V. Le, and Charles Sutton. Program synthesis with large
  language models. CoRR, abs/2108.07732, 2021.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei
   Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu,
   Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong
   Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang,
   Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan
   Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang
   Zhu. Qwen technical report. CoRR, abs/2309.16609, 2023.
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie
  Wang, Jun Tang, et al. Qwen2.5-VL technical report. arXiv preprint arXiv:2502.13923, 2025.
Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa,
  Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. The Belebele benchmark:
  A parallel reading comprehension dataset in 122 language variants. CoRR, abs/2308.16884, 2023.
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind
  Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss,
  Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
  Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin
  Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario
  Amodei. Language models are few-shot learners. In NeurIPS, 2020.
Federico Cassano, John Gouwar, Daniel Nguyen, Sydney Nguyen, Luna Phipps-Costin, Donald Pinckney,
  Ming-Ho Yee, Yangtian Zi, Carolyn Jane Anderson, Molly Q. Feldman, Arjun Guha, Michael Greenberg,
  and Abhinav Jangda. MultiPL-E: A scalable and polyglot approach to benchmarking neural code
  generation. IEEE Trans. Software Eng., 49(7):3675–3691, 2023.
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pondé de Oliveira Pinto, Jared Kaplan,
 Harrison Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen
 Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick
 Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe
 Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes,
 Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor
 Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr,
 Jan Leike, Joshua Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles
 Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish,
 Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code. CoRR,
 abs/2107.03374, 2021.
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias
  Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman.
 Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021.
Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding
 Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and
 Wenfeng Liang. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language
 models. CoRR, abs/2401.06066, 2024.


                                                   31
Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated
  convolutional networks. In ICML, volume 70 of Proceedings of Machine Learning Research, pp. 933–941.
  PMLR, 2017.
Google DeepMind. Gemini 2.5, 2025. URL https://blog.google/technology/google-deepmind/gemi
 ni-model-thinking-updates-march-2025/.
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, An-
 dreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton,
 Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme Ruiz, Matthias Minderer,
 Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin Fathy Elsayed, Aravindh
 Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Collier, Alexey A. Gritsenko,
 Vighnesh Birodkar, Cristina Nader Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip
 Pavetic, Dustin Tran, Thomas Kipf, Mario Lucic, Xiaohua Zhai, Daniel Keysers, Jeremiah J. Harmsen,
 and Neil Houlsby. Scaling vision transformers to 22 billion parameters. In ICML, volume 202 of
 Proceedings of Machine Learning Research, pp. 7480–7512. PMLR, 2023.
Xinrun Du, Yifan Yao, Kaijing Ma, Bingli Wang, Tianyu Zheng, King Zhu, Minghao Liu, Yiming Liang,
  Xiaolong Jin, Zhenlin Wei, et al. SuperGPQA: Scaling LLM evaluation across 285 graduate disciplines.
  arXiv preprint arXiv:2502.14739, 2025.
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha
 Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn,
 Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston
 Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany Biron, Binh
 Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell,
 Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus
 Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv
 Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin,
 Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang,
 Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Grégoire Mialon, Guan
 Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov,
 Imanol Arrieta Ibarra, Isabel M. Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan
 Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock,
 Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu,
 Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia,
 Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, and
 et al. The Llama 3 herd of models. CoRR, abs/2407.21783, 2024.
Simin Fan, Matteo Pagliardini, and Martin Jaggi. DoGE: Domain reweighting with generalization
  estimation. arXiv preprint arXiv:2310.15393, 2023.
Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino,
  Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, et al. Are we
  done with MMLU? CoRR, abs/2406.04127, 2024.
Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I.
  Wang. CRUXEval: A benchmark for code reasoning, understanding and execution. arXiv preprint
  arXiv:2401.03065, 2024.
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma,
 Peiyi Wang, Xiao Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement
 learning. arXiv preprint arXiv:2501.12948, 2025.
Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu
  Xu, Hongjiang Lv, et al. Multi-IF: Benchmarking LLMs on multi-turn and multilingual instructions
  following. arXiv preprint arXiv:2410.15553, 2024.
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob
 Steinhardt. Measuring massive multitask language understanding. In ICLR. OpenReview.net, 2021a.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song,
 and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In NeurIPS
 Datasets and Benchmarks, 2021b.
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang,
  and Boris Ginsburg. RULER: What’s the real context size of your long-context language models? CoRR,
  abs/2404.06654, 2024.


                                                  32
Yuzhen Huang, Yuzhuo Bai, Zhihao Zhu, Junlei Zhang, Jinghan Zhang, Tangjun Su, Junteng Liu,
  Chuancheng Lv, Yikai Zhang, Jiayi Lei, Yao Fu, Maosong Sun, and Junxian He. C-Eval: A multi-
  level multi-discipline chinese evaluation suite for foundation models. In NeurIPS, 2023.
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen
  Yu, Keming Lu, et al. Qwen2.5-Coder technical report. CoRR, abs/2409.12186, 2024.
Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-
 Lezama, Koushik Sen, and Ion Stoica. LiveCodeBench: Holistic and contamination free evaluation of
 large language models for code. CoRR, abs/2403.07974, 2024.
Zixuan Jiang, Jiaqi Gu, Hanqing Zhu, and David Z. Pan. Pre-RMSNorm and Pre-CRMSNorm Transform-
  ers: Equivalent and efficient pre-LN Transformers. CoRR, abs/2305.14858, 2023.
Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman,
 Lester James V. Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf,
 Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini, Noah A.
 Smith, Yizhong Wang, Pradeep Dasigi, and Hannaneh Hajishirzi. Tülu 3: Pushing frontiers in open
 language model post-training. CoRR, abs/2411.15124, 2024.
Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez,
  and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-Hard and BenchBuilder
  pipeline. CoRR, abs/2406.11939, 2024.
Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John
 Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. CoRR, abs/2305.20050, 2023.
Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark,
  and Yejin Choi. ZebraLogic: On the scaling limits of LLMs for logical reasoning. CoRR, abs/2502.01100,
  2025.
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi
  Deng, Chenyu Zhang, Chong Ruan, et al. DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437,
  2024a.
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT
   really correct? Rigorous evaluation of large language models for code generation. In NeurIPS, 2023a.
Qian Liu, Xiaosen Zheng, Niklas Muennighoff, Guangtao Zeng, Longxu Dou, Tianyu Pang, Jing Jiang,
  and Min Lin. RegMix: Data mixture as regression for language model pre-training. arXiv preprint
  arXiv:2407.01492, 2024b.
Xiao Liu, Xuanyu Lei, Shengyuan Wang, Yue Huang, Zhuoer Feng, Bosi Wen, Jiale Cheng, Pei Ke, Yifan
  Xu, Weng Lam Tam, Xiaohan Zhang, Lichao Sun, Hongning Wang, Jing Zhang, Minlie Huang, Yuxiao
  Dong, and Jie Tang. AlignBench: Benchmarking Chinese alignment of large language models. CoRR,
  abs/2311.18743, 2023b.
Meta-AI. The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation, 2025.
 URL https://ai.meta.com/blog/llama-4-multimodal-intelligence/.
OpenAI. Hello GPT-4o, 2024. URL https://openai.com/index/hello-gpt-4o/.
OpenAI. Multilingual massive multitask language understanding, 2024. URL https://huggingface.co
 /datasets/openai/MMMLU.
OpenAI. Learning to reason with LLMs, 2024. URL https://openai.com/index/learning-to-reaso
 n-with-llms/.
OpenAI. Introducing openai o3 and o4-mini, 2025. URL https://openai.com/index/introducing-o
 3-and-o4-mini/.
Samuel J. Paech. Creative writing v3, 2024. URL https://eqbench.com/creative writing.html.
Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. YaRN: Efficient context window
  extension of large language models. CoRR, abs/2309.00071, 2023.
Zihan Qiu, Zeyu Huang, Bo Zheng, Kaiyue Wen, Zekun Wang, Rui Men, Ivan Titov, Dayiheng Liu,
  Jingren Zhou, and Junyang Lin. Demons in the detail: On implementing load balancing loss for
  training specialized mixture-of-expert models. CoRR, abs/2501.11873, 2025.


                                                  33
Shanghaoran Quan, Jiaxi Yang, Bowen Yu, Bo Zheng, Dayiheng Liu, An Yang, Xuancheng Ren, Bofei
  Gao, Yibo Miao, Yunlong Feng, Zekun Wang, Jian Yang, Zeyu Cui, Yang Fan, Yichang Zhang, Binyuan
  Hui, and Junyang Lin. CodeElo: Benchmarking competition-level code generation of LLMs with
  human-comparable Elo ratings. CoRR, abs/2501.01257, 2025.
Qwen Team. QwQ: Reflect deeply on the boundaries of the unknown, November 2024. URL https:
 //qwenlm.github.io/blog/qwq-32b-preview/.
Qwen Team. QwQ-32B: Embracing the power of reinforcement learning, March 2025. URL https:
 //qwenlm.github.io/blog/qwq-32b/.
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani,
 Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark.
 CoRR, abs/2311.12022, 2023.
Angelika Romanou, Negar Foroutan, Anna Sotnikova, Zeming Chen, Sree Harsha Nelaturu, Shivalika
 Singh, Rishabh Maheshwary, Micol Altomare, Mohamed A. Haggag, Snegha A, Alfonso Amayuelas,
 Azril Hafizi Amirudin, Viraat Aryabumi, Danylo Boiko, Michael Chang, Jenny Chim, Gal Cohen,
 Aditya Kumar Dalmia, Abraham Diress, Sharad Duwal, Daniil Dzenhaliou, Daniel Fernando Erazo Flo-
 rez, Fabian Farestam, Joseph Marvin Imperial, Shayekh Bin Islam, Perttu Isotalo, Maral Jabbarishiviari,
 Börje F. Karlsson, Eldar Khalilov, Christopher Klamm, Fajri Koto, Dominik Krzeminski, Gabriel Adri-
 ano de Melo, Syrielle Montariol, Yiyang Nan, Joel Niklaus, Jekaterina Novikova, Johan Samir Obando
 Ceron, Debjit Paul, Esther Ploeger, Jebish Purbey, Swati Rajwal, Selvan Sunitha Ravi, Sara Rydell,
 Roshan Santhosh, Drishti Sharma, Marjana Prifti Skenduli, Arshia Soltani Moakhar, Bardia Soltani
 Moakhar, Ran Tamir, Ayush Kumar Tarun, Azmine Toushik Wasi, Thenuka Ovin Weerasinghe, Serhan
 Yilmaz, Mike Zhang, Imanol Schlag, Marzieh Fadaee, Sara Hooker, and Antoine Bosselut. INCLUDE:
 evaluating multilingual language understanding with regional knowledge. CoRR, abs/2411.19799,
 2024.
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with
  subword units. In ACL (1). The Association for Computer Linguistics, 2016.
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and
  Daya Guo. DeepSeekMath: Pushing the limits of mathematical reasoning in open language models.
  CoRR, abs/2402.03300, 2024.
Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won
  Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are
  multilingual chain-of-thought reasoners. In ICLR. OpenReview.net, 2023.
Guijin Son, Jiwoo Hong, Hyunwoo Ko, and James Thorne. Linguistic generalizability of test-time scaling
 in mathematical reasoning. CoRR, abs/2502.17407, 2025.
Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer:
   Enhanced Transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung,
 Aakanksha Chowdhery, Quoc V. Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging BIG-Bench
 tasks and whether chain-of-thought can solve them. In ACL (Findings), pp. 13003–13051. Association
 for Computational Linguistics, 2023.
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah
 Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report.
 arXiv preprint arXiv:2503.19786, 2025.
Changhan Wang, Kyunghyun Cho, and Jiatao Gu. Neural machine translation with byte-level subwords.
  In AAAI, pp. 9154–9160. AAAI Press, 2020.
Yiming Wang, Pei Zhang, Jialong Tang, Haoran Wei, Baosong Yang, Rui Wang, Chenshu Sun, Feitong
  Sun, Jiran Zhang, Junxuan Wu, Qiqian Cang, Yichang Zhang, Fei Huang, Junyang Lin, Fei Huang, and
  Jingren Zhou. PolyMath: Evaluating mathematical reasoning in multilingual contexts, 2025.
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren,
  Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang
  Yue, and Wenhu Chen. MMLU-Pro: A more robust and challenging multi-task language understanding
  benchmark. CoRR, abs/2406.01574, 2024.



                                                  34
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Benjamin Feuer, Siddhartha Jain, Ravid Shwartz-
  Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, Chinmay Hegde, Yann LeCun, Tom Goldstein, Willie
  Neiswanger, and Micah Goldblum. LiveBench: A challenging, contamination-free LLM benchmark.
  CoRR, abs/2406.19314, 2024.
Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, Shaopeng Lai, Yuran Ren, Zijia Wang, Ji Zhang, Mengyue
  Wu, Qin Jin, and Fei Huang. WritingBench: A comprehensive benchmark for generative writing. CoRR,
  abs/2503.05244, 2025.
xAI. Grok 3 beta — the age of reasoning agents, 2025. URL https://x.ai/news/grok-3.
Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy S Liang, Quoc V
  Le, Tengyu Ma, and Adams Wei Yu. Doremi: Optimizing data mixtures speeds up language model
  pretraining. Advances in Neural Information Processing Systems, 36:69798–69818, 2023.
Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi
 Rungta, Karthik Abinav Sankararaman, Barlas Oguz, Madian Khabsa, Han Fang, Yashar Mehdad,
 Sharan Narang, Kshitiz Malik, Angela Fan, Shruti Bhosale, Sergey Edunov, Mike Lewis, Sinong Wang,
 and Hao Ma. Effective long-context scaling of foundation models. CoRR, abs/2309.16039, 2023.
Fanjia Yan, Huanzhi Mao, Charlie Cheng-Jie Ji, Tianjun Zhang, Shishir G. Patil, Ion Stoica, and Joseph E.
  Gonzalez. Berkeley function calling leaderboard. https://gorilla.cs.berkeley.edu/blogs/8 ber
  keley function calling leaderboard.html, 2024.
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li,
 Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang,
 Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He,
 Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang,
 Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu,
 Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang,
 Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu,
 Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang Guo, and Zhihao Fan. Qwen2 technical report. CoRR,
 abs/2407.10671, 2024a.
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng
 Liu, Fei Huang, Haoran Wei, et al. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115, 2024b.
An Yang, Beichen Zhang, Binyuan Hui, Bofei Gao, Bowen Yu, Chengpeng Li, Dayiheng Liu, Jianhong Tu,
 Jingren Zhou, Junyang Lin, et al. Qwen2.5-Math technical report: Toward mathematical expert model
 via self-improvement. CoRR, abs/2409.12122, 2024c.
Yidan Zhang, Boyi Deng, Yu Wan, Baosong Yang, Haoran Wei, Fei Huang, Bowen Yu, Junyang Lin, and
  Jingren Zhou. P-MMEval: A parallel multilingual multitask benchmark for consistent evaluation of
  LLMs. CoRR, abs/2411.09116, 2024.
Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and
   Le Hou. Instruction-following evaluation for large language models. CoRR, abs/2311.07911, 2023.
Qin Zhu, Fei Huang, Runyu Peng, Keming Lu, Bowen Yu, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang,
  and Junyang Lin. AutoLogi: Automated generation of logic puzzles for evaluating reasoning abilities
  of large language models. CoRR, abs/2502.16906, 2025.




                                                   35