VOOZH about

URL: https://apxml.com/models/qwen3-coder-480b-a35b


Qwen3 Coder 480B A35B

Active Parameters

480B

Context Length

262K

Modality

Text

Architecture

Mixture of Experts (MoE)

License

Apache 2.0

Release Date

22 Jul 2025

Knowledge Cutoff

Dec 2024

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

96

Key-Value Heads

8

Attention Head Dimension

128

Position Embedding

Absolute Position Embedding

RoPE Theta

10,000,000

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

6,144

Number of Layers

62

FFN Intermediate Size (Dense)

2,560

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

151,936

Mixture of Experts

Total Expert Parameters

35.0B

Number of Experts

160

Active Experts

8

Shared Experts

-

FFN Intermediate Size (per Expert)

2,560

Dense Layers Before MoE

-

Architecture Diagram

Qwen3 Coder 480B A35B

Qwen3 Coder 480B A35B is Alibaba's advanced agentic artificial intelligence model, specifically engineered for high-performance software development and autonomous coding workflows. As a specialized variant of the Qwen 3 family, it is designed to manage complex multi-turn programming tasks, including comprehensive repository analysis, cross-file reasoning, and automated pull request generation. The model serves as the primary engine for autonomous software engineering, enabling deep integration with developer tools and terminal-based agents like Qwen Code.

Architecturally, the model utilizes a sparse Mixture-of-Experts (MoE) decoder-only transformer framework. It comprises a total of 480 billion parameters, while maintaining computational efficiency by activating only 35 billion parameters per inference query. This configuration employs 160 total experts, with 8 active experts selected via a gating mechanism for each token. The underlying structure features 62 transformer layers and incorporates Grouped Query Attention (GQA) with 96 query heads and 8 key-value heads to optimize memory bandwidth and inference speed. It utilizes Rotary Position Embeddings (RoPE) and is optimized for long-horizon context through techniques such as YaRN, supporting a native context window of 262,144 tokens that can be extended up to one million.

The model is trained on a massive dataset of 7.5 trillion tokens, with a 70% concentration on source code and technical content across multiple programming languages including Python, JavaScript, C++, and Rust. Its post-training phase leverages long-horizon reinforcement learning, specifically Agent RL and Code RL, to improve multi-step planning and interaction with external tools such as browsers and CLI environments. This specialization allows the model to function as a sophisticated coding agent capable of executing complex engineering tasks and managing entire codebases with high precision.

About Qwen 3

The Alibaba Qwen 3 model family comprises dense and Mixture-of-Experts (MoE) architectures, with parameter counts from 0.6B to 235B. Key innovations include a hybrid reasoning system, offering 'thinking' and 'non-thinking' modes for adaptive processing, and support for extensive context windows, enhancing efficiency and scalability.


Other Qwen 3 Models

Evaluation Benchmarks

Rank

#91

BenchmarkScoreRank

General Text

Text Arena

1388

60

Web Development

WebDev Arena

1282

83

Rankings

Overall Rank

#91

Coding Rank

#92

Model Integrity

Total Score

B

68 / 100

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
128k
256k

VRAM Required:

Recommended GPUs