Qwen3.5-Omni Technical Report

Download PDF

Qwen3.5-Omni Technical Report

Metadata

Compute Setup

Qwen3.5-Omni does not name a specific training accelerator in the extracted text. Under the project rule, the hardware setup is therefore inferred from the frontier accelerator setup about one year before its April 2026 publication: 2025-era Blackwell HGX B200 or TPU7x-class clusters. The paper itself only confirms internal GPU-style serving through vLLM, torch.compile, and CUDA Graph acceleration for the MTP module and codec decoder. The safe claim is not "trained on B200," but "designed for frontier long-context, high-concurrency accelerator serving."

The model is a hundreds-of-billions-parameter omnimodal system with 256K-token context. Thinker consumes text, image, video, and audio representations and emits text; Talker conditions on Thinker outputs and context to stream speech tokens. Both use Hybrid-Attention MoE designs. AuT is trained from scratch on 40 million hours of audio-text data and produces 6.25 Hz audio tokens after 16x downsampling. Pretraining includes early omnimodal data, a 4T-token general stage, and a long-context stage that raises maximum length from 32,768 to 262,144 tokens. The broader pool contains more than 100 million hours of audio-visual content; Talker uses more than 20 million hours of multilingual speech.

Bottleneck

The bottleneck is the combination of media bandwidth, long-context memory, and real-time latency. An omni model must handle dense audio streams, video frames, text, tool calls, and speech generation without turning interaction into a slow batch job. Earlier temporal-position schemes create sparse absolute time IDs on long audio-video inputs, weakening long-range temporal modeling and increasing data-construction cost. Speech generation adds a different systems issue: text tokenizers and speech tokenizers have mismatched rates, causing skipped words, bad pronunciations, unstable prosody, and awkward number rendering if the two streams are synchronized naively.

Deployment is a first-packet-latency problem. In streaming audio-visual interaction, user experience depends on the time to first text token, first audio chunk, steady-state output-token latency, codec decode time, and generation real-time factor. Long audio and video also stress KV-cache I/O, so supporting 10 hours of audio or 400 seconds of 720P video at 1 FPS requires more than scaling model size.

Method Adaptation

Qwen3.5-Omni adapts the architecture to long streaming media. AuT compresses audio to low-rate tokens. Video and audio-video patches receive explicit timestamp strings rather than relying only on sparse absolute temporal IDs, and audio receives random timestamps during training. This slightly lengthens context, but gives the model a more natural timecode representation. Chunk-wise streaming input lets encoders output temporal chunks and reduces time-to-first-token.

The Hybrid MoE backbone includes a Gated Delta Net module that the paper identifies as useful for accelerating long audio-video sequence modeling and reducing KV-cache I/O. Talker uses RVQ codec tokens, a lightweight multi-token prediction module for residual codebooks, and a causal streaming ConvNet Code2Wav renderer. ARIA converts the previous dual-track text/speech generation pattern into a single interleaved stream with an adaptive speech-to-text rate constraint. Its technical value is alignment as well as latency: the constraint reduces skipped words, pronunciation errors, and prosody instability while making token scheduling more incremental.

Post-training is also modality-specific. Thinker uses specialist distillation from domain teachers, then on-policy distillation to transfer stronger text-conditioned responses into audio-conditioned queries, then interaction-aligned RL for code-switching, persona consistency, and long-context instruction following. Talker uses general speech pretraining, long-context continual pretraining to 64K, DPO plus rule-based rewards and GSPO, and speaker fine-tuning for controllable voices. Agentically, the model includes autonomous WebSearch, complex FunctionCall, and Audio-Visual Vibe Coding, where executable code is generated from audio-visual instructions without external orchestration.

Evidence

The paper reports concrete latency numbers. Table 1 gives first-packet latency for audio input of 435 ms for Plus and 235 ms for Flash, and for video input of 651 ms for Plus and 426 ms for Flash. In the concurrency table, latency remains stable enough for streaming: Flash at 1 concurrency reports 235/426 ms overall latency for audio/video input, while Plus reports 435/651 ms; at 8 concurrency, Flash reports 352/1625 ms and Plus 955/1980 ms. The authors caution that Plus and Flash use different resource allocation and parallelization strategies, so these are deployment measurements rather than a direct model-size contest. Generation RTF remains below 0.34 in the reported settings.

Capability evidence spans modalities. Qwen3.5-Omni-Plus stays near the text-only Qwen3.5-Plus-Instruct on text tasks: 85.9 on MMLU-Pro, 83.9 on GPQA, 59.6 on LongBench v2, 65.6 on LiveCodeBench v6, and 81.0 on TAU2Bench. On audio, it beats Gemini-3.1 Pro on MMAU, MMSU, RUL-MuchoMusic, VoiceBench, many ASR tests, and speech-to-text translation. On video and visual tasks, Plus scores 81.9 on VideoMME, 86.8 on MLVU, 79.0 on MVBench, 71.2 on LVBench, and 77.0 on MME-VideoOCR. For audio-visual tasks, it reaches 84.6 on DailyOmni, 68.5 on Qualcomm IVD, 64.8 on Omni-Cloze, and 57.2 on OmniGAIA. Zero-shot SEED-TTS WER is 0.99 for Chinese and 1.26 for English.

Historical Effect

Qwen3.5-Omni is a 2026 example of generative media compute becoming an interactive agent runtime. Earlier media systems often separated perception, speech recognition, text reasoning, TTS, and tool use. This report treats them as one streaming compute graph with latency targets, codec-token scheduling, temporal alignment, long-context memory, and interaction quality. The method value is the unification of temporal grounding, speech-token alignment, and tool-capable reasoning in one model. Historically, the "media" bottleneck is no longer just high-resolution generation or diffusion sampling.

Limits

The main limitation for this card is hardware opacity: the paper gives large-scale data and latency results but not the exact training accelerator. Some latency figures are theoretical internal-serving measurements, and the authors explicitly warn that Plus and Flash use different deployment-time resource allocation and parallelization strategies. The timestamp approach improves long temporal perception but increases context length. Speech output covers 36 languages and dialects, fewer than the 113 speech-input varieties. The post-training section reveals remaining failure modes that required extra stages: audio-conditioned answers lagged text-conditioned answers, and multi-turn conversations showed code-switching, persona inconsistency, and degraded instruction following over long contexts. On OmniGAIA, Qwen3.5-Omni-Plus scores 57.2 while Gemini-3.1 Pro is listed at 68.9, so native omni tool use is strong but not uniformly frontier-leading.

Links