![]() |
VOOZH | about |
Active Parameters
106B
Context Length
128K
Modality
Multimodal
Architecture
Mixture of Experts (MoE)
License
MIT License
Release Date
28 Jul 2025
Knowledge Cutoff
Mar 2025
Attention
Attention Structure
Multi-Head Attention
Attention Heads
96
Key-Value Heads
8
Attention Head Dimension
128
Position Embedding
Absolute Position Embedding
RoPE Theta
1,000,000
Sliding Window Attention
No
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
Swish
Dimensions
Hidden Dimension Size
4,096
Number of Layers
46
FFN Intermediate Size (Dense)
1,408
Multi-Token Prediction Heads
1
Tokenizer
Vocabulary Size
151,552
Mixture of Experts
Total Expert Parameters
12.0B
Number of Experts
129
Active Experts
9
Shared Experts
1
FFN Intermediate Size (per Expert)
1,408
Dense Layers Before MoE
1
GLM-4.5-Air is a high-efficiency large language model developed by Z.ai as part of the GLM-4.5 series. It is designed to bridge the gap between massive-scale foundation models and the practical constraints of on-device or mid-range cloud deployments. Optimized primarily for agent-oriented workflows, the model prioritizes reasoning, complex instruction following, and code generation. It functions as a versatile engine for autonomous agents capable of multi-step planning and tool invocation, making it a viable selection for developers building sophisticated digital assistants and automated software engineering pipelines.
Architecturally, the model utilizes a sparse Mixture-of-Experts (MoE) framework, featuring 106 billion total parameters with only 12 billion active per forward pass. This design incorporates 128 routed experts and a specialized shared expert layer, activating 9 experts per token to maintain representational capacity while significantly reducing computational overhead. The transformer block is further enhanced by a Multi-Token Prediction (MTP) layer, which allows the model to predict several future tokens simultaneously. This implementation facilitates speculative decoding, which increases inference throughput and provides a responsive experience for real-time interactive applications.
Technical innovations in GLM-4.5-Air include the adoption of Grouped-Query Attention (GQA) with 96 attention heads and 8 key-value groups, reducing memory bandwidth requirements during long-context processing. The model supports a 128,000-token context window using Rotary Positional Embeddings (RoPE) and features a hybrid reasoning system. This system allows for a deliberate thinking mode, which executes a latent chain-of-thought process for analytical problem-solving, and a standard mode for immediate output. Native integration for function calling, web browsing, and code execution ensures the model can interact with external environments with high reliability.
General Language Models from Z.ai
Rank
#110
| Benchmark | Score | Rank |
|---|---|---|
Web Development WebDev Arena | 1372 | 49 |
Professional Knowledge MMLU Pro | 0.81 | 58 |
General Text Text Arena | 1373 | 64 |
Overall Rank
#110
Coding Rank
#59
Total Score
B
70 / 100
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
©2025 ApX Machine Learning
APX AI
Online