GLM-4.6-AWQ - Optimized 4-bit Quantization for Production Deployment

High-performance AWQ quantization of ZHIPU AI's GLM-4.6 (357B MoE) optimized for vLLM inference

👁 License
👁 vLLM Compatible
👁 Quantization
👁 HF Model

📊 Model Overview

This is a professionally quantized 4-bit AWQ version of Z.ai's GLM-4.6 optimized for high-throughput production deployment with vLLM.

Base Model: GLM-4.6 (357B parameters, 160 experts MoE)
Model Size: 176 GB (39 safetensors files)
License: MIT (inherited from base model)
Quantization: AWQ 4-bit with group size 128
Active Parameters: 28.72B per token (8 of 160 experts)
Quantization Framework: llmcompressor 0.8.1.dev0
Optimization: Marlin kernels for NVIDIA GPUs
Context Length: Up to 200K tokens (131K recommended for optimal performance)
Languages: English, Chinese

🚀 Performance Benchmarks

Tested on 4× NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB each, 384GB total VRAM):

Configuration	Throughput	VRAM/GPU	Total VRAM	Use Case
With Expert Parallelism	~60 tok/s	~47GB	~188GB	Recommended: Multi-model deployment
Without Expert Parallelism	~65 tok/s	~95GB	~384GB	Single model, maximum speed

Performance Characteristics

Memory Bandwidth Efficiency: 50.3% (excellent for MoE models)
Theoretical Maximum: 130 tok/s (memory bandwidth bound)
Aggregate Bandwidth: 1.7 TB/s effective (4× RTX PRO 6000 Blackwell Max-Q)
Actual vs Theoretical: Typical for sparse MoE architecture

Why AWQ Over Other Quantizations?

Method	Accuracy	Speed	Disk Size	VRAM	Status
AWQ 4-bit	Best (indistinguishable from BF16)	Fast (Marlin kernels)	176GB	188GB	✅ This model
GPTQ 4-bit	Lower (2× MMLU drop vs AWQ)	Similar	~180GB	~188GB	⚠️ Overfits calibration data
FP8	Higher precision	3.5× slower	~330GB	~330GB	❌ Unoptimized kernels
BF16	Highest	N/A	~714GB	800GB+	❌ Too large for most setups

Research shows: AWQ has ~1 point MMLU drop while GPTQ has ~2 points. AWQ performance is indistinguishable from full BF16 on real-world benchmarks.

💾 VRAM Requirements

Minimum Requirements (Expert Parallelism)

Model Download Size: 176 GB
4× GPUs with 48GB+ VRAM each (192GB total minimum)
Recommended: 4× 80GB GPUs or 4× 96GB GPUs
Memory Type: HBM2e/HBM3/HBM3e for best performance
Disk Space: 180+ GB for model storage

Supported Configurations

Setup	GPUs	VRAM/GPU	Total VRAM	Disk	Performance
Tested	4×RTX PRO 6000 Blackwell Max-Q (96GB)	~47GB	384GB	176GB	~60 tok/s
Optimal	4×H100 (80GB)	~47GB	320GB	176GB	~75-80 tok/s
Budget	4×A100 (80GB)	~47GB	320GB	176GB	~50-55 tok/s
High-Speed	2×H200 NVL	~95GB	192GB	176GB	~100+ tok/s

🛠️ Installation & Usage

Prerequisites

pip install vllm>=0.11.0
# Or install from source for latest features
git clone https://github.com/vllm-project/vllm.git
cd vllm && pip install -e .

Quick Start with vLLM

Recommended Configuration (Expert Parallelism for Multi-Model Deployment):

vllm serve <model_path> \
 --tensor-parallel-size 4 \
 --enable-expert-parallel \
 --tool-call-parser glm45 \
 --reasoning-parser glm45 \
 --enable-auto-tool-choice \
 --served-model-name glm-4.6-awq \
 --max-model-len 131072 \
 --gpu-memory-utilization 0.9 \
 --trust-remote-code \
 --port 8000

Maximum Speed Configuration (Single Model):

vllm serve <model_path> \
 --tensor-parallel-size 4 \
 --tool-call-parser glm45 \
 --reasoning-parser glm45 \
 --enable-auto-tool-choice \
 --served-model-name glm-4.6-awq \
 --max-model-len 131072 \
 --gpu-memory-utilization 0.9 \
 --trust-remote-code \
 --port 8000

Python API Usage

from vllm import LLM, SamplingParams

# Initialize with expert parallelism (saves VRAM)
llm = LLM(
 model="path/to/GLM-4.6-AWQ",
 tensor_parallel_size=4,
 enable_expert_parallel=True,
 max_model_len=131072,
 trust_remote_code=True,
 gpu_memory_utilization=0.9
)

# Disable reasoning overhead for maximum speed
prompts = [
 "Explain quantum computing in simple terms. /nothink",
 "Write a Python function to calculate Fibonacci numbers. /nothink"
]

sampling_params = SamplingParams(
 temperature=0.7,
 top_p=0.95,
 max_tokens=400
)

outputs = llm.generate(prompts, sampling_params)
for output in outputs:
 print(output.outputs[0].text)

OpenAI-Compatible API

from openai import OpenAI

client = OpenAI(
 base_url="http://localhost:8000/v1",
 api_key="dummy" # vLLM doesn't require authentication
)

response = client.chat.completions.create(
 model="glm-4.6-awq",
 messages=[
 {"role": "user", "content": "Explain quantum computing /nothink"}
 ],
 max_tokens=400,
 temperature=0.7
)

print(response.choices[0].message.content)

🔧 Quantization Details

Technical Specifications

Method: Activation-Aware Weight Quantization (AWQ)
Precision: 4-bit signed integers
Group Size: 128 (optimal balance of speed/accuracy)
Calibration Dataset: neuralmagic/LLM_compression_calibration (512 samples)
Format: Compressed-tensors with Marlin kernel support
Kernel: MarlinLinearKernel + CompressedTensorsWNA16MarlinMoEMethod

What Was Quantized?

✅ All 92 transformer decoder layers (layers 0-91)
✅ All 160 experts per layer (MoE experts)
✅ Attention projections (Q, K, V, O)
✅ MLP projections (gate, up, down)
❌ LM head (kept at full precision for output quality)
❌ MTP layer 92 (removed - incompatible with 4-bit quantization)

Note on MTP (Multi-Token Prediction): The original GLM-4.6 includes a speculative decoding layer (layer 92) for drafting multiple tokens. This layer has been intentionally removed from this quantization because:

4-bit precision is insufficient for MTP to achieve acceptable draft token acceptance rates (0% acceptance observed)
Adds 1.92GB VRAM without providing speedup benefits
Research shows 8-bit or FP16 precision is required for effective MTP

Quantization Process

This model was quantized using the following configuration:

from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
from datasets import load_dataset

# Load calibration data from Neural Magic's curated dataset
dataset = load_dataset("neuralmagic/LLM_compression_calibration", split="train")
dataset = dataset.shuffle(seed=42).select(range(512))

# Define ignore patterns and targets
ignore_patterns = [
 "lm_head",
 "model.embed_tokens",
 "re:.*input_layernorm$",
 "re:.*post_attention_layernorm$",
 "model.norm",
 "re:.*q_norm$",
 "re:.*k_norm$",
 "re:.*shared_experts.*",
 "re:.*mlp\\.gate\\.weight$",
 "re:.*mlp\\.gate\\..*bias$",
 "re:model.layers.[0-2]\\.",
]

targets = [
 "re:.*gate_proj.*",
 "re:.*up_proj.*",
 "re:.*down_proj.*",
 "re:.*k_proj.*",
 "re:.*q_proj.*",
 "re:.*v_proj.*",
 "re:.*o_proj.*",
]

# AWQ quantization recipe
recipe = [
 AWQModifier(
 ignore=ignore_patterns,
 config_groups={
 "group_0": {
 "targets": targets,
 "weights": {
 "num_bits": 4,
 "type": "int",
 "symmetric": True,
 "group_size": 128,
 "strategy": "group",
 "dynamic": False,
 },
 "input_activations": None,
 "output_activations": None,
 "format": None,
 }
 },
 )
]

# Apply quantization
oneshot(
 model=model, # Pre-loaded AutoModelForCausalLM
 dataset=dataset,
 recipe=recipe,
 max_seq_length=2048,
 num_calibration_samples=512
)

⚡ Performance Optimization Tips

1. Use `/nothink` for Maximum Speed

GLM-4.6 includes a reasoning mode that adds thinking overhead. Disable it for ~9% speedup:

# Add /nothink to your prompts
prompt = "Your question here /nothink"

2. Enable Expert Parallelism

Distribute experts across GPUs to save VRAM for multi-model serving:

--enable-expert-parallel # Saves ~50GB total VRAM across 4 GPUs

3. Optimize Context Length

Longer context = more KV cache memory:

--max-model-len 131072 # Recommended (vs default 202752)

4. Tune Concurrent Requests

--max-num-seqs 1 # Minimum KV cache (single request at max context)
--max-num-seqs 64 # Higher throughput (multiple concurrent requests)

5. Monitor Memory Bandwidth

This model is memory bandwidth bound. Faster GPUs see proportional speedups:

H100 (3.35 TB/s): ~120 tok/s
H200 NVL (4.8 TB/s): ~165 tok/s
RTX PRO 6000 Blackwell Max-Q (1.75 TB/s): ~60 tok/s

🎯 Use Cases

Recommended Applications

✅ Production Chatbots: Fast, accurate responses with minimal VRAM
✅ Multi-Model Serving: Expert parallelism enables running multiple models
✅ Code Generation: High accuracy maintained vs full precision
✅ Reasoning Tasks: Use default mode (without /nothink)
✅ Long Context: Supports up to 202K tokens

Not Recommended For

❌ Speculative Decoding: MTP layer removed (requires 8-bit+ precision)
❌ Extreme Precision Tasks: Use FP8 or BF16 if accuracy is critical
❌ Single GPU Deployment: Requires 4× GPUs minimum

📈 Accuracy Benchmarks

AWQ quantization maintains excellent quality:

Metric	BF16 Baseline	This AWQ 4-bit	GPTQ 4-bit	Difference
MMLU	100.0%	~99.0%	~98.0%	AWQ: -1%, GPTQ: -2%
Perplexity	Baseline	+2-3%	+5-8%	AWQ significantly better
Real Tasks	100.0%	~100.0%	95-97%	AWQ indistinguishable

Key Finding: Research shows AWQ performs indistinguishably from BF16 on real-world benchmarks, while GPTQ shows measurable degradation due to overfitting on calibration data.

🔬 Technical Deep Dive

Architecture

Type: Mixture of Experts (MoE) Transformer
Total Parameters: 357B (base model specification)
Experts: 160 routed experts per layer
Active Experts: 8 per token (5% utilization)
Layers: 92 decoder layers
Heads: 96 attention heads (8 KV heads)
Hidden Size: 5120
Intermediate Size: 12288 (dense), 1536 (MoE)
Vocabulary: 151,552 tokens
Context Window: 200K tokens (original spec)

Memory Layout

Component	Per GPU (EP)	Total (4 GPUs)	Percentage
Model Weights	~12GB	~48GB	25%
Expert Weights	~28GB	~112GB	60%
KV Cache	~5GB	~20GB	11%
Activation	~2GB	~8GB	4%
Total	~47GB	~188GB	100%

Why Marlin Kernels?

Marlin is the state-of-the-art kernel for 4-bit quantized inference:

Speed: 2-3× faster than CUDA native 4-bit
Efficiency: Optimized for Ampere/Ada/Hopper/Blackwell architectures
Features: Fused dequantization + GEMM operations
Support: Integrated into vLLM for production use

🔍 Comparison to Other Models

Model	Parameters	Disk Size	Quantization	Speed	VRAM	Accuracy
GLM-4.6-AWQ (this)	357B	176GB	AWQ 4-bit	60 tok/s	188GB	Excellent
GLM-4.6-GPTQ	357B	~180GB	GPTQ 4-bit	60 tok/s	188GB	Good
GLM-4.6-FP8	357B	~330GB	FP8	19 tok/s	330GB	Better
GLM-4.6-BF16	357B	~714GB	None	N/A	800GB+	Highest
DeepSeek-V3-AWQ	671B	~300GB	AWQ 4-bit	45 tok/s	250GB	Excellent
Qwen2.5-72B-AWQ	72B	~40GB	AWQ 4-bit	120 tok/s	48GB	Excellent

📝 Known Limitations

Requires 4× GPUs: Minimum deployment configuration
No MTP Support: Speculative decoding layer removed
Memory Bandwidth Bound: Speed scales with GPU memory bandwidth
TP=4 Only: Tested configuration (other TP sizes may work)
vLLM Dependency: Optimized specifically for vLLM runtime

🐛 Troubleshooting

"KeyError: 'Linear'" Error

Run the fix script to add required config:

python fix_awq_config_for_vllm.py --model /path/to/GLM-4.6-AWQ

Out of Memory Errors

Enable expert parallelism: --enable-expert-parallel
Reduce context length: --max-model-len 65536
Lower GPU utilization: --gpu-memory-utilization 0.85
Limit concurrent requests: --max-num-seqs 1

Slow Inference

Check /nothink is appended to prompts
Verify Marlin kernels are active (check logs)
Monitor GPU utilization (nvidia-smi dmon)
Ensure NVLink is working between GPUs

📚 Citation

If you use this quantized model, please cite:

@software{glm4_awq_2025,
 title = {GLM-4.6-AWQ: Production-Optimized 4-bit Quantization},
 author = {bullpoint},
 year = {2025},
 url = {https://huggingface.co/bullpoint/GLM-4.6-AWQ}
}

@article{lin2023awq,
 title={AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration},
 author={Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Dang, Xingyu and Han, Song},
 journal={arXiv preprint arXiv:2306.00978},
 year={2023}
}

@software{zai2025glm46,
 title={GLM-4.6},
 author={Z.ai and ZHIPU AI},
 year={2025},
 url={https://huggingface.co/zai-org/GLM-4.6},
 license={MIT}
}

📜 License

MIT License - This quantized model inherits the MIT license from the original GLM-4.6 model.

You are free to:

✅ Use commercially
✅ Modify and distribute
✅ Use privately
✅ Sublicense

See the base model repository for full license terms.

🙏 Acknowledgments

Z.ai for the original GLM-4.6 model
ZHIPU AI for the GLM architecture and training
vLLM Team for the excellent inference engine
MIT Han Lab for the AWQ algorithm
Neural Magic for:
- llm-compressor quantization toolkit
- LLM_compression_calibration calibration dataset
Community for testing and feedback

🔧 Reproduction

Want to quantize this model yourself? See the included quantize_glm46_awq.py script for the exact quantization configuration used.

Quantization Hardware Requirements

This model was quantized on modest hardware with extensive CPU offloading:

GPU: 1× NVIDIA RTX PRO 6000 Blackwell Max-Q (96GB GDDR7)
RAM: 768GB DDR5
Swap: 300GB (actively used during quantization)
Quantization Time: ~5 hours (includes calibration, smoothing, compression, and saving)

Note: The quantization process offloads the full BF16 model (~714GB) to system RAM/swap since it exceeds available VRAM. Using 4 GPUs during quantization provides no speed benefit - the process is CPU memory-bound, not GPU-bound. The included script defaults to single-GPU mode (CUDA_VISIBLE_DEVICES=0) for optimal resource usage.

Key Settings

Calibration dataset: neuralmagic/LLM_compression_calibration
Samples: 512
Sequence length: 2048 tokens
Group size: 128
Bits: 4 (symmetric int)
Device map: Sequential (CPU offloading enabled)

📬 Support

For issues and questions:

Model Issues: Open an issue on this model's repository
vLLM Issues: vLLM GitHub
Quantization: llm-compressor GitHub

Status: ✅ Production Ready | Last Updated: October 2025 | Tested With: vLLM 0.11.0+

Downloads last month: 500

Safetensors

Model size

51B params

Tensor type

BF16

I64

I32

F32

Model tree for bullpoint/GLM-4.6-AWQ

Base model

zai-org/GLM-4.6

Quantized

(40)

this model

Dataset used to train bullpoint/GLM-4.6-AWQ

Paper for bullpoint/GLM-4.6-AWQ

Paper • 2306.00978 • Published Jun 1, 2023 • 13

URL: https://huggingface.co/bullpoint/GLM-4.6-AWQ

⇱ bullpoint/GLM-4.6-AWQ · Hugging Face