![]() |
VOOZH | about |
Active Parameters
80B
Context Length
66K
Modality
Reasoning
Architecture
Mixture of Experts (MoE)
License
Apache-2.0
Release Date
1 Feb 2026
Knowledge Cutoff
Jun 2025
Attention
Attention Structure
Multi-Head Attention
Attention Heads
16
Key-Value Heads
2
Attention Head Dimension
256
Position Embedding
Absolute Position Embedding
RoPE Theta
10,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
2,048
Number of Layers
48
FFN Intermediate Size (Dense)
512
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
151,936
Mixture of Experts
Total Expert Parameters
79.0B
Number of Experts
512
Active Experts
10
Shared Experts
-
FFN Intermediate Size (per Expert)
512
Dense Layers Before MoE
-
Qwen3-Next-80B-A3B is a high-capacity sparse Mixture-of-Experts (MoE) foundation model developed by Alibaba's Qwen team. It belongs to the next-generation Qwen3-Next series, specifically designed to address the computational demands of long-context sequence modeling and large-scale parameter efficiency. The model features a unique hybrid attention mechanism that integrates Gated DeltaNet with Gated Attention, allowing the system to maintain high performance across extended token sequences while significantly reducing the quadratic complexity typically associated with standard Transformer architectures.
The technical architecture employs a high-sparsity MoE layout consisting of 48 layers with a hidden dimension of 2048. While the model contains 80 billion total parameters, its gating mechanism activates only approximately 3 billion parameters per token during inference. This sparse activation strategy, combined with a total of 512 experts and a multi-token prediction (MTP) objective, facilitates improved throughput and reduced FLOPs per token. The model also incorporates stability-focused architectural refinements, such as zero-centered and weight-decayed layer normalization, to ensure robust convergence during both pre-training on 15 trillion tokens and subsequent reinforcement learning stages.
Optimized for complex reasoning and agentic workflows, Qwen3-Next-80B-A3B is capable of processing a native context window of 262,144 tokens, which can be extended to over 1 million tokens using specialized scaling techniques like YaRN. Its primary use cases include multi-step logical analysis, mathematical proofs, and code synthesis. By separating the 'Thinking' variant, which outputs structured reasoning traces, from the standard 'Instruct' variant, the model provides specialized paths for either high-efficiency general-purpose interaction or intensive, transparent problem-solving tasks.
The Alibaba Qwen 3 model family comprises dense and Mixture-of-Experts (MoE) architectures, with parameter counts from 0.6B to 235B. Key innovations include a hybrid reasoning system, offering 'thinking' and 'non-thinking' modes for adaptive processing, and support for extensive context windows, enhancing efficiency and scalability.
Rank
#132
| Benchmark | Score | Rank |
|---|---|---|
Mathematics LiveBench Mathematics | 0.74 | 31 |
Graduate-Level QA GPQA | 0.772 | 33 |
Web Development WebDev Arena | 1402 | 35 |
Data Analysis LiveBench Data Analysis | 0.50 | 36 |
Coding LiveBench Coding | 0.68 | 41 |
Reasoning LiveBench Reasoning | 0.58 | 42 |
General Text Text Arena | 1402 | 51 |
Agentic Coding LiveBench Agentic | 0.10 | 53 |
Professional Knowledge MMLU Pro | 0.83 | 56 |
Overall Rank
#132
Coding Rank
#77
Total Score
B+
72 / 100
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
©2025 ApX Machine Learning
APX AI
Online