Qwen3 Technical Report

Metadata

Reading status: read complete
Year: 2025
Compute regime: Hyperscale dense LLM training (hyperscale_dense_llm_training)
PDF: 2025-qwen3_2025.pdf
Extracted text: 2025-qwen3_2025.txt
PDF URL: https://arxiv.org/pdf/2505.09388.pdf
OpenAlex:
Citation count source/date: Frontier provisional 2026-06-15
Citation count:
Reading card created: 2026-06-15

Compute Setup

The Qwen3 report does not state the training hardware, accelerator count, memory size, or interconnect. Under the project rule, the hardware setup is therefore inferred from the frontier accelerator context about one year before the May 2025 publication: H100/H200-class GPU clusters or comparable TPU v5p/v6e pod-scale systems, with high-bandwidth HBM and scale-up fabric as the practical constraint. The local accelerator-era map lists 2024 H200 SXM as a memory-expanded Hopper device with 141 GB HBM3e and 4.8 TB/s bandwidth, and 2024 TPU v6e as a TPU training/serving chip with 32 GB HBM and 256-chip pods. This is inferred, not paper-stated hardware.

The paper's stated compute shape is the model and data scale. Qwen3 includes six dense models from 0.6B to 32B parameters and two MoE models: Qwen3-30B-A3B and Qwen3-235B-A22B. The flagship has 235B total parameters but activates 22B per token. Pretraining uses 36T tokens over 119 languages and dialects. The three stages are over 30T tokens at sequence length 4096, about 5T higher-quality STEM/coding/reasoning tokens at 4096, and hundreds of billions of long-context tokens at sequence length 32768. Most released models advertise 128K context at inference, with YaRN and Dual Chunk Attention used for extension.

Bottleneck

The bottleneck is no longer just dense pretraining FLOPs. Qwen3 has to fit three pressures into one release: huge multilingual token scale, sparse MoE capacity with tolerable per-token cost, and inference-time reasoning that can be dialed up or down without deploying separate models. The 235B MoE model reduces active parameters to 22B per token, but routing 128 experts with 8 active experts still creates load-balancing, communication, and serving-memory problems. Long context adds KV-cache and attention pressure, especially when the same model must support both fast non-thinking responses and long thinking traces.

Post-training also becomes a compute bottleneck. The report explicitly treats thinking tokens as an inference resource: users can allocate a thinking budget to trade latency for performance. It also says building every smaller model through the full four-stage reasoning pipeline would be expensive, so strong-to-weak distillation is used to reduce post-training GPU hours.

Method Adaptation

Qwen3 adapts the Transformer stack to this accelerator setting by combining dense-model stabilizers, sparse activation, and inference-budget controls. Dense models use GQA, SwiGLU, RoPE, RMSNorm, removal of QKV bias, and QK-Norm; the technical value is more stable attention and trainability at large scale, not only lower device pressure. The MoE models segment capacity into 128 experts, activate 8 per token, remove Qwen2.5-MoE shared experts, and add global-batch load balancing for expert specialization. This is also a compute adaptation: total parameter memory scales up, while active matmul per token remains closer to a smaller dense model.

The training schedule is also device-shaped. Qwen3 spends most tokens at 4096 sequence length, then reserves a final long-context stage for 32768-token training data. RoPE base frequency is raised from 10000 to 1000000, then YaRN and DCA provide a four-fold inference extension. This avoids training the whole 36T-token run at the maximum advertised context length.

Post-training converts reasoning into a controllable serving mode and a technical behavior model. The flagship models use long-CoT cold start, reasoning RL, thinking-mode fusion, and general RL. The /think and /no think flags teach one model to represent both deliberative and direct response styles, while thinking budgets make partial reasoning usable. Smaller models use strong-to-weak distillation: off-policy response distillation teaches both modes, and on-policy logit distillation transfers teacher distributions from Qwen3-32B or Qwen3-235B-A22B.

Evidence

The report's cost evidence is mostly relative, but concrete. For base models, it states that Qwen3 MoE base models can match Qwen3 dense base models with only 1/5 activated parameters, and that Qwen3-235B-A22B-Base outperforms DeepSeek-V3-Base on 14 of 15 benchmarks with about 1/3 the total parameters and 2/3 activated parameters. It also says Qwen3-235B-A22B-Base beats Qwen2.5-72B-Base across all listed base benchmarks while using fewer than 1/3 of the activated parameters.

For post-training, Qwen3-235B-A22B reaches 85.7 on AIME'24, 81.5 on AIME'25, 70.7 on LiveCodeBench v5, 2056 on CodeForces, and 70.8 on BFCL v3. In the reasoning RL stage, AIME'24 for Qwen3-235B-A22B rises from 70.1 to 85.1 over 170 RL steps. The distillation comparison is the clearest compute result: starting from the same off-policy-distilled 8B checkpoint, reinforcement learning uses 17920 GPU hours and reaches 67.6 on AIME'24 and 52.9 on LiveCodeBench v5, while on-policy distillation uses 1800 GPU hours and reaches 74.4 and 60.3. The paper summarizes this as roughly 1/10 the GPU hours with better performance.

Long-context evidence is mixed but useful. On RULER, non-thinking Qwen3-235B-A22B averages 95.0 and scores 90.6 at 128K, while thinking mode averages 92.2 and scores 86.0 at 128K. The authors argue that thinking content can interfere with retrieval-style long-context tasks.

Historical Effect

Qwen3 is a 2025 example of open LLM engineering after the field had separated raw pretraining, post-training, and inference-time compute into distinct budgets. Its historical role is not a single new primitive; it is the packaging of several compute adaptations and technical capabilities into one model family. MoE adds specialized capacity, long-context staging extends usable context, and thinking budgets make reasoning depth an explicit behavior knob.

The card therefore belongs mainly to hyperscale LLM training, with overlaps into sparse scaling and inference-time reasoning. It shows how a frontier-style open model family tries to preserve broad benchmark quality while making deployment choices explicit: dense or MoE size, thinking or non-thinking mode, and smaller distilled students when full post-training would be too expensive.

Limits

The biggest limit for this history is hardware opacity. The source text gives no device count, accelerator model, interconnect, optimizer sharding setup, training wall time, or achieved utilization. Any hardware discussion must remain an inference from the repo's accelerator-era map.

The report also gives little absolute cost accounting for the 36T-token run or for MoE serving. It claims cheaper inference and training per trillion tokens, but does not expose enough data to compute those savings. Long-context reasoning is not uniformly beneficial: the appendix says thinking mode degrades on RULER, probably because retrieval tasks do not need extra reasoning tokens. Future work explicitly points to more pretraining scale, compression, extremely long contexts, and more compute for agent RL, so the model remains compute-limited rather than a finished efficiency solution.