Qwen3-235B-A22B

Active Parameters

235B

Context Length

131K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

29 Apr 2025

Knowledge Cutoff

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

128

Key-Value Heads

Attention Head Dimension

128

Position Embedding

ROPE

RoPE Theta

1,000,000

Sliding Window Attention

Sliding Window Size

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

10,240

Number of Layers

100

FFN Intermediate Size (Dense)

1,536

Multi-Token Prediction Heads

Tokenizer

Vocabulary Size

151,936

Mixture of Experts

Total Expert Parameters

22.0B

Number of Experts

128

Active Experts

Shared Experts

FFN Intermediate Size (per Expert)

1,536

Dense Layers Before MoE

Architecture Diagram

Qwen3-235B-A22B

Qwen3-235B-A22B is a flagship Mixture-of-Experts (MoE) large language model developed by Alibaba Cloud, forming part of the Qwen3 series. Its primary purpose is to address high-performance computational linguistics tasks requiring advanced reasoning and comprehensive knowledge. This model is engineered for handling complex assignments such as sophisticated code generation, intricate mathematical problem-solving, and multi-step logical deduction. It is also designed to be highly effective in applications that necessitate processing of extended documents, managing multi-turn conversations, and analyzing enterprise-scale datasets.

The technical architecture of Qwen3-235B-A22B incorporates a unified framework that integrates both a 'thinking mode' and a 'non-thinking mode'. The thinking mode facilitates complex, multi-step reasoning by explicitly showing intermediate thought processes, while the non-thinking mode provides rapid, direct responses. This dual-mode design enables dynamic switching based on task complexity or user queries, allowing for adaptive allocation of computational resources during inference. The MoE architecture is characterized by its sparse activation mechanism, utilizing top-2 expert routing, where each input token is dynamically routed to its two most relevant experts chosen from a total of 128 experts. Despite a total parameter count of 235 billion, only 22 billion parameters are actively engaged during inference for any given input, contributing to efficiency. The model's foundation is built upon a pre-training corpus of approximately 36 trillion tokens, encompassing 119 languages and dialects. Architectural components include Grouped-Query Attention (GQA) for optimized attention mechanisms, Rotary Positional Embedding (RoPE) for position encoding, and the integration of Flash Attention for accelerated processing. Normalization is performed using pre-norm RMSNorm, and the activation function employed is SwiGLU.

The performance characteristics of Qwen3-235B-A22B highlight its capabilities in instruction following, logical reasoning, comprehensive text understanding, and proficiency across mathematics, science, and coding tasks. Its design prioritizes efficiency, with the MoE architecture significantly lowering the computational resources required per inference step, thereby reducing energy consumption and operational costs. The model supports a substantial context length, which enhances its ability to maintain coherence and retrieve relevant information over long sequences. The weights are made publicly available under the Apache 2.0 license, promoting widespread adoption and further research within the artificial intelligence community. This accessibility allows for deployment across various frameworks and platforms, including local development environments such as Ollama, LMStudio, and llama.cpp.

About Qwen 3

The Alibaba Qwen 3 model family comprises dense and Mixture-of-Experts (MoE) architectures, with parameter counts from 0.6B to 235B. Key innovations include a hybrid reasoning system, offering 'thinking' and 'non-thinking' modes for adaptive processing, and support for extensive context windows, enhancing efficiency and scalability.

Other Qwen 3 Models

Evaluation Benchmarks

Rank

#98

Benchmark	Score	Rank
General Knowledge MMLU	0.878	7
Coding Aider Coding	0.60	15
Professional Knowledge MMLU Pro	0.84	22
Web Development WebDev Arena	1422	28
Graduate-Level QA GPQA	0.775	32
Coding LiveBench Coding	0.70	38
Reasoning LiveBench Reasoning	0.58	41
Mathematics LiveBench Mathematics	0.68	41
General Text Text Arena	1423	43
Agentic Coding LiveBench Agentic	0.13	52
Data Analysis LiveBench Data Analysis	0.45	53

Rankings

Overall Rank

#98

Coding Rank

#55

Model Integrity

Total Score

B+

73 / 100

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

64k

128k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Release Notes Read the Paper Download Weights

About Contact Compute Efficiency Content Integrity Terms of Use Privacy Policy

URL: https://apxml.com/models/qwen3-235b-a22b

⇱

Qwen3-235B-A22B

Technical Specifications

Architecture Diagram

Qwen3-235B-A22B

About Qwen 3

Other Qwen 3 Models

Evaluation Benchmarks

Rankings

Model Integrity

GPU Requirements

VRAM Required:

Recommended GPUs

Resources