GLM-4.5-Air

Active Parameters

106B

Context Length

128K

Modality

Multimodal

Architecture

Mixture of Experts (MoE)

License

MIT License

Release Date

28 Jul 2025

Knowledge Cutoff

Mar 2025

Technical Specifications

Attention

Attention Structure

Multi-Head Attention

Attention Heads

Key-Value Heads

Attention Head Dimension

128

Position Embedding

Absolute Position Embedding

RoPE Theta

1,000,000

Sliding Window Attention

Sliding Window Size

Normalization

RMS Normalization

Activation Function

Swish

Dimensions

Hidden Dimension Size

4,096

Number of Layers

FFN Intermediate Size (Dense)

1,408

Multi-Token Prediction Heads

Tokenizer

Vocabulary Size

151,552

Mixture of Experts

Total Expert Parameters

12.0B

Number of Experts

129

Active Experts

Shared Experts

FFN Intermediate Size (per Expert)

1,408

Dense Layers Before MoE

Architecture Diagram

GLM-4.5-Air

GLM-4.5-Air is a high-efficiency large language model developed by Z.ai as part of the GLM-4.5 series. It is designed to bridge the gap between massive-scale foundation models and the practical constraints of on-device or mid-range cloud deployments. Optimized primarily for agent-oriented workflows, the model prioritizes reasoning, complex instruction following, and code generation. It functions as a versatile engine for autonomous agents capable of multi-step planning and tool invocation, making it a viable selection for developers building sophisticated digital assistants and automated software engineering pipelines.

Architecturally, the model utilizes a sparse Mixture-of-Experts (MoE) framework, featuring 106 billion total parameters with only 12 billion active per forward pass. This design incorporates 128 routed experts and a specialized shared expert layer, activating 9 experts per token to maintain representational capacity while significantly reducing computational overhead. The transformer block is further enhanced by a Multi-Token Prediction (MTP) layer, which allows the model to predict several future tokens simultaneously. This implementation facilitates speculative decoding, which increases inference throughput and provides a responsive experience for real-time interactive applications.

Technical innovations in GLM-4.5-Air include the adoption of Grouped-Query Attention (GQA) with 96 attention heads and 8 key-value groups, reducing memory bandwidth requirements during long-context processing. The model supports a 128,000-token context window using Rotary Positional Embeddings (RoPE) and features a hybrid reasoning system. This system allows for a deliberate thinking mode, which executes a latent chain-of-thought process for analytical problem-solving, and a standard mode for immediate output. Native integration for function calling, web browsing, and code execution ensures the model can interact with external environments with high reliability.

About GLM Family

General Language Models from Z.ai

Other GLM Family Models

Evaluation Benchmarks

Rank

#110

Benchmark	Score	Rank
Web Development WebDev Arena	1372	49
Professional Knowledge MMLU Pro	0.81	58
General Text Text Arena	1373	64

Rankings

Overall Rank

#110

Coding Rank

#59

Model Integrity

Total Score

70 / 100

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

63k

125k

VRAM Required:

Recommended GPUs

Resources

Official Documentation Download Weights Source Code

About Contact Compute Efficiency Content Integrity Terms of Use Privacy Policy

URL: https://apxml.com/models/glm-45-air

⇱

GLM-4.5-Air

Technical Specifications

Architecture Diagram

GLM-4.5-Air

About GLM Family

Other GLM Family Models

Evaluation Benchmarks

Rankings

Model Integrity

GPU Requirements

VRAM Required:

Recommended GPUs

Resources