MiniMax M2.5 (Mixed-Precision FP8 + INT4 AWQ FrankenQuant)
This strives to be the highest quality quant that can run on 192GiB VRAM
π‘ A non-FP8 version is available at mratsim/MiniMax-M2.5-BF16-INT4-AWQ
That version is compatible with 8x RTX 3090s and with SGLang (which doesn't support mixed quantization yet) for an extra 3GiB in VRAM.
This FP8+INT4 AWQ was build by merging the original FP8 self-attention weights and mratsim/MiniMax-M2.5-BF16-INT4-AWQ experts.
It features:
That model has ensured that all experts are calibrated, not doing so is extremely detrimental, PR: https://github.com/vllm-project/llm-compressor/pull/2171
Mixed precision with:
- self-attention weights copied directly from the official version (default FP8 with 2D-blocks)
- experts weights quantized using AWQ W4A16G32 scheme (4-bit weights, 16-bit activations, scaling factor per group of 32 weights)
High-quality large and diverse dataset with programming and devops focus as well as domain-specific knowledge (math, sciences, medical, finance, business, humanities, philosophy, creative writing), general knowledge, pop culture and behavioral situations because we never code in a vacuum. And we want to make sure all experts are calibrated to the full range of their activations.
Calibration explicitly tests multilingual capabilities:
- Asia: Chinese, Hindi, Korean, Japanese
- Europe: French, German, Portuguese, Russian, Spanish
- Middle-East: Arabic, Hebrew, Turkish
Calibration explicitly tests 60 programming languages and not just Python:
- Imperative programming: C, C++, Go, Zig, ...
- Functional programming: Haskell, F#, OCaml, Erlang, Lisp, Clojure ...
- Web-focused: HTML/CSS, Typescript, PHP, ...
- Mixed paradigm: D, Kotlin, Nim, Rust, Swift, ...
- Theorem provers: Coq, Lean
- Low-level: ARM64 assembly, x86-64 assembly, LLVM IR
- GPU Programming: Cuda, Vulkan, Apple Metal
- Game Programming: GDScript, GLSL
- Domain-specific: MATLAB, Julia, Solidity, R
Calibration tries to ensure coverage for a wide variety of experience (from explaining concepts to your grandmother to debugging Kubernetes logs)
Built by a dev, for devs (and it looks very good for STEM as well)
It uses my new declarative quantization framework https://github.com/mratsim/quantizers which facilitates highly-tuned calibration sets: calibrate_software_engineer.yaml
π₯ Usage & Running Instructions
The model was tested with vLLM + 2x RTX Pro 6000, here is a script suitable for such configuration with the maximum 196,608 context length. This uses 92.5GiB of VRAM with the flashinfer backend.
β οΈ Due to rope_parameters change, at the moment this model is incompatible with transformers V5.
This makes it incompatible with GLM-4.6V which requires transformers V5. Use different Docker images.
β οΈ SGLang does not support this model due to missing mixed precision support. Feature request raised at https://github.com/sgl-project/sglang/issues/16276.
Please use mratsim/MiniMax-M2.5-BF16-INT4-AWQ in the meantime.
Running script
--trust-remote-code is necessary until the transformers team merges github.com/huggingface/transformers/pull/42028
You have 2 reasoning parsers;
minimax_m2, puts the reasoning content in a special field like DeepSeek models that is usually rendered in a specific manner in frontends.minimax_m2_append_think, puts the reasoning into<think>reasoning_content</think>and that is sent as normal text. Few frontends properly render that, I'm aware of Cherry Studio on Desktop and ChatterUI on Android.
The reason why minimax_m2_append_think was introduced was Interleaved Thinking and having the model build upon it's previous thinking (usually frontends discard the thinking trace)
π‘In MiniMax-M2.1 with the recommended parameters the model tended to get stuck in repetition loops in vLLM
It seemed like repetition_penalty: 1.10, frequency_penalty: 0.40 avoided that.You may want to try recommended settings without repetition_penalty first (and it slows down token generation)
# Model configuration (Mandatory)
MODEL="mratsim/MiniMax-M2.5-FP8-INT4-AWQ"
MODELNAME="MiniMax-M2.5"
GPU_UTIL=0.93
SAMPLER_OVERRIDE='{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'
# Prevent memory fragmentation
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
# Prevent vLLM from using 100% CPU when idle (Very Recommended)
export VLLM_SLEEP_WHEN_IDLE=1
vllm serve "${MODEL}" \
--served-model-name "${MODELNAME}" \
--trust-remote-code \
--gpu-memory-utilization ${GPU_UTIL} \
--tp 2 \
--override-generation-config "${SAMPLER_OVERRIDE}" \
--enable-auto-tool-choice \
--tool-call-parser minimax_m2 \
--reasoning-parser minimax_m2
# --reasoning-parser minimax_m2_append_think
Performance
On dual RTX Pro 6000, I can reach over 5500 prefill/prompt/context processing and over 100 tok/s token generation for a single request.
With PagedAttention in action you can reach over 25000 tok/s in prompt processing speed.
When batching, with default config, you can reach over 6000 even 8000 tok/s and 1200 tok/s generation speed.
Tune prefill vs decode prioritization with --max_num_batched_tokens see Performance & Tuning | vLLM
In a steady state with interleaved prefill and decode requests that interrupt each other, you can get ~2400 tok/s context processing and 800 tok/s generation
Note: vLLM supports prefill-decode disaggregation for high throughput serving if you have double the minimum hardware:
- https://pytorch.org/blog/disaggregated-inference-at-scale-with-pytorch-vllm/
- https://github.com/vllm-project/production-stack
- Prefill/decode disaggregation
- Multi-Tier KV-cache via LMCache (GPU > CPU > Local Disk)
- Cache aware router
- Multi-model dispatch via single interface
π¬ Quantization method
Quantization was quite complex for this model and was done in 3 steps:
- Original weights are in FP8, they were dequantized to FP16 due to llm-compressor not being able to process FP8.
- llm-compressor was used to quantize the MLP experts projection using AWQ, with PR #2171 to ensure they were all activated.
- Stitching the FrankenQuant: I combined the original weights, including the 2D-block FP8, with the experts-only AWQ weights.
The llmcompressor library was used with the following recipe:
default_stage:
default_modifiers:
AWQModifier:
config_groups:
mlp_experts_projections:
# Include only MLP expert weights for 4-bit quantization
targets: ["re:.*block_sparse_moe\\.experts\\.\\d+\\.(w1|w2|w3)$"]
weights:
num_bits: 4
type: int
symmetric: true
group_size: 32
strategy: group
dynamic: false
# actorder: group
observer: memoryless_minmax
mappings:
- smooth_layer: re:.*post_attention_layernorm$
balance_layers: ["re:.*w1$", "re:.*w3$"]
- smooth_layer: re:.*w3$
balance_layers: ["re:.*w2$"]
duo_scaling: true
The calibration set had 590 examples, 8192 sequence length, 60 programming languages, 12 spoken languages and is detailed at calibrate_software_engineer.yaml
Quantization theory and heuristics for manual tuning
- Downloads last month
- 232
Model tree for mratsim/MiniMax-M2.5-FP8-INT4-AWQ
Base model
MiniMaxAI/MiniMax-M2.5