MiniMax M2.5 (Mixed-Precision FP8 + INT4 AWQ FrankenQuant)

This strives to be the highest quality quant that can run on 192GiB VRAM

💡 A non-FP8 version is available at mratsim/MiniMax-M2.5-BF16-INT4-AWQ
That version is compatible with 8x RTX 3090s and with SGLang (which doesn't support mixed quantization yet) for an extra 3GiB in VRAM.
This FP8+INT4 AWQ was build by merging the original FP8 self-attention weights and mratsim/MiniMax-M2.5-BF16-INT4-AWQ experts.

It features:

That model has ensured that all experts are calibrated, not doing so is extremely detrimental, PR: https://github.com/vllm-project/llm-compressor/pull/2171
Mixed precision with:
- self-attention weights copied directly from the official version (default FP8 with 2D-blocks)
- experts weights quantized using AWQ W4A16G32 scheme (4-bit weights, 16-bit activations, scaling factor per group of 32 weights)
High-quality large and diverse dataset with programming and devops focus as well as domain-specific knowledge (math, sciences, medical, finance, business, humanities, philosophy, creative writing), general knowledge, pop culture and behavioral situations because we never code in a vacuum. And we want to make sure all experts are calibrated to the full range of their activations.
Calibration explicitly tests multilingual capabilities:
- Asia: Chinese, Hindi, Korean, Japanese
- Europe: French, German, Portuguese, Russian, Spanish
- Middle-East: Arabic, Hebrew, Turkish
Calibration explicitly tests 60 programming languages and not just Python:
- Imperative programming: C, C++, Go, Zig, ...
- Functional programming: Haskell, F#, OCaml, Erlang, Lisp, Clojure ...
- Web-focused: HTML/CSS, Typescript, PHP, ...
- Mixed paradigm: D, Kotlin, Nim, Rust, Swift, ...
- Theorem provers: Coq, Lean
- Low-level: ARM64 assembly, x86-64 assembly, LLVM IR
- GPU Programming: Cuda, Vulkan, Apple Metal
- Game Programming: GDScript, GLSL
- Domain-specific: MATLAB, Julia, Solidity, R
Calibration tries to ensure coverage for a wide variety of experience (from explaining concepts to your grandmother to debugging Kubernetes logs)
Built by a dev, for devs (and it looks very good for STEM as well)

It uses my new declarative quantization framework https://github.com/mratsim/quantizers which facilitates highly-tuned calibration sets: calibrate_software_engineer.yaml

📥 Usage & Running Instructions

The model was tested with vLLM + 2x RTX Pro 6000, here is a script suitable for such configuration with the maximum 196,608 context length. This uses 92.5GiB of VRAM with the flashinfer backend.

⚠️ Due to rope_parameters change, at the moment this model is incompatible with transformers V5.
This makes it incompatible with GLM-4.6V which requires transformers V5. Use different Docker images.

⚠️ SGLang does not support this model due to missing mixed precision support. Feature request raised at https://github.com/sgl-project/sglang/issues/16276.

Please use mratsim/MiniMax-M2.5-BF16-INT4-AWQ in the meantime.

Running script

--trust-remote-code is necessary until the transformers team merges github.com/huggingface/transformers/pull/42028

You have 2 reasoning parsers;

minimax_m2, puts the reasoning content in a special field like DeepSeek models that is usually rendered in a specific manner in frontends.
minimax_m2_append_think, puts the reasoning into <think>reasoning_content</think> and that is sent as normal text. Few frontends properly render that, I'm aware of Cherry Studio on Desktop and ChatterUI on Android.

The reason why minimax_m2_append_think was introduced was Interleaved Thinking and having the model build upon it's previous thinking (usually frontends discard the thinking trace)

💡In MiniMax-M2.1 with the recommended parameters the model tended to get stuck in repetition loops in vLLM
It seemed like repetition_penalty: 1.10, frequency_penalty: 0.40 avoided that.

You may want to try recommended settings without repetition_penalty first (and it slows down token generation)

# Model configuration (Mandatory)
MODEL="mratsim/MiniMax-M2.5-FP8-INT4-AWQ"
MODELNAME="MiniMax-M2.5"
GPU_UTIL=0.93
SAMPLER_OVERRIDE='{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'

# Prevent memory fragmentation
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512

# Prevent vLLM from using 100% CPU when idle (Very Recommended)
export VLLM_SLEEP_WHEN_IDLE=1

vllm serve "${MODEL}" \
 --served-model-name "${MODELNAME}" \
 --trust-remote-code \
 --gpu-memory-utilization ${GPU_UTIL} \
 --tp 2 \
 --override-generation-config "${SAMPLER_OVERRIDE}" \
 --enable-auto-tool-choice \
 --tool-call-parser minimax_m2 \
 --reasoning-parser minimax_m2
 # --reasoning-parser minimax_m2_append_think

Performance

On dual RTX Pro 6000, I can reach over 5500 prefill/prompt/context processing and over 100 tok/s token generation for a single request.

👁 image

With PagedAttention in action you can reach over 25000 tok/s in prompt processing speed.

👁 image

When batching, with default config, you can reach over 6000 even 8000 tok/s and 1200 tok/s generation speed.
Tune prefill vs decode prioritization with --max_num_batched_tokens see Performance & Tuning | vLLM

👁 image

In a steady state with interleaved prefill and decode requests that interrupt each other, you can get ~2400 tok/s context processing and 800 tok/s generation

👁 image

Note: vLLM supports prefill-decode disaggregation for high throughput serving if you have double the minimum hardware:

https://pytorch.org/blog/disaggregated-inference-at-scale-with-pytorch-vllm/
https://github.com/vllm-project/production-stack
- Prefill/decode disaggregation
- Multi-Tier KV-cache via LMCache (GPU > CPU > Local Disk)
- Cache aware router
- Multi-model dispatch via single interface

🔬 Quantization method

Quantization was quite complex for this model and was done in 3 steps:

Original weights are in FP8, they were dequantized to FP16 due to llm-compressor not being able to process FP8.
llm-compressor was used to quantize the MLP experts projection using AWQ, with PR #2171 to ensure they were all activated.
Stitching the FrankenQuant: I combined the original weights, including the 2D-block FP8, with the experts-only AWQ weights.

The llmcompressor library was used with the following recipe:

default_stage:
 default_modifiers:
 AWQModifier:
 config_groups:
 mlp_experts_projections:
 # Include only MLP expert weights for 4-bit quantization
 targets: ["re:.*block_sparse_moe\\.experts\\.\\d+\\.(w1|w2|w3)$"]
 weights:
 num_bits: 4
 type: int
 symmetric: true
 group_size: 32
 strategy: group
 dynamic: false
 # actorder: group
 observer: memoryless_minmax

 mappings:
 - smooth_layer: re:.*post_attention_layernorm$
 balance_layers: ["re:.*w1$", "re:.*w3$"]
 - smooth_layer: re:.*w3$
 balance_layers: ["re:.*w2$"]
 duo_scaling: true

The calibration set had 590 examples, 8192 sequence length, 60 programming languages, 12 spoken languages and is detailed at calibrate_software_engineer.yaml