MiniMax M2.1 (Mixed-Precision BF16 + INT4 AWQ)

This strives to be the highest quality quant that can run on 192GiB VRAM

💡This is a sister model to mratsim/MiniMax-M2.1-FP8-INT4-AWQ with the original model FP8 weights pre-dequantized to BF16.

This makes it compatible with 8x3090 systems (which don't have hardware FP8) and also compatible with SGLang for an extra 3 GiB in VRAM.

It features:

That model has ensured that all experts are calibrated, not doing so is extremely detrimental, PR: https://github.com/vllm-project/llm-compressor/pull/2171
Mixed precision with:
- self-attention weights copied directly from the official version (default FP8 with 2D-blocks)
- experts weights quantized using AWQ W4A16G32 scheme (4-bit weights, 16-bit activations, scaling factor per group of 32 weights)
High-quality large and diverse dataset with programming and devops focus as well as domain-specific knowledge (math, sciences, medical, finance, business, humanities, philosophy, creative writing), general knowledge, pop culture and behavioral situations because we never code in a vacuum. And we want to make sure all experts are calibrated to the full range of their activations.
Calibration explicitly tests multilingual capabilities:
- Asia: Chinese, Hindi, Korean, Japanese
- Europe: French, German, Portuguese, Russian, Spanish
- Middle-East: Arabic, Hebrew, Turkish
Calibration explicitly tests 60 programming languages and not just Python:
- Imperative programming: C, C++, Go, Zig, ...
- Functional programming: Haskell, F#, OCaml, Erlang, Lisp, Clojure ...
- Web-focused: HTML/CSS, Typescript, PHP, ...
- Mixed paradigm: D, Kotlin, Nim, Rust, Swift, ...
- Theorem provers: Coq, Lean
- Low-level: ARM64 assembly, x86-64 assembly, LLVM IR
- GPU Programming: Cuda, Vulkan, Apple Metal
- Game Programming: GDScript, GLSL
- Domain-specific: MATLAB, Julia, Solidity, R
Calibration tries to ensure coverage for a wide variety of experience (from explaining concepts to your grandmother to debugging Kubernetes logs)
Built by a dev, for devs (and it looks very good for STEM as well)

It uses my new declarative quantization framework https://github.com/mratsim/quantizers which facilitates highly-tuned calibration sets: calibrate_software_engineer.yaml

📥 Usage & Running Instructions

The model was tested with SGLang + 2x RTX Pro 6000, here is a script suitable for such configuration with the maximum 196,608 context length. This uses 92.5GiB of VRAM with the flashinfer backend.

Please refer to mratsim/MiniMax-M2.1-FP8-INT4-AWQ#running-script for running it in vLLM

Running script

--trust-remote-code is necessary until the transformers team merges github.com/huggingface/transformers/pull/42028

You have 2 reasoning parsers;

minimax, puts the reasoning content in a special field like DeepSeek models that is usually rendered in a specific manner in frontends.
minimax_append_think, puts the reasoning into <think>reasoning_content</think> and that is sent as normal text. Few frontends properly render that, I'm aware of Cherry Studio on Desktop and ChatterUI on Android.

The reason why minimax_append_think was introduced was Interleaved Thinking and having the model build upon it's previous thinking (usually frontends discard the thinking trace)

💡In the sister model, I mentioned that with the recommended parameters the model tends to get stuck in repetition loops.
This does not seem to happen with SGLang hence "repetition_penalty: 1.10, frequency_penalty: 0.40" are not used.
There is no way to override such settings without editing generation_config.json anyway: https://github.com/sgl-project/sglang/issues/15487

# Model configuration (Mandatory)
MODEL="mratsim/MiniMax-M2.1-BF16-INT4-AWQ"
MODELNAME="MiniMax-M2.1"
GPU_UTIL=0.93
SGLANG_PORT=8000

# Prevent memory fragmentation
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512

python3 -m sglang.launch_server \
 --host 0.0.0.0 \
 --port "${SGLANG_PORT}" \
 --sleep-on-idle \
 --disable-custom-all-reduce \
 --max-running-requests 64 \
 --cuda-graph-max-bs 64 \
 --attention-backend flashinfer \
 --served-model-name "${MODELNAME}" \
 --model-path "${MODEL}" \
 --tool-call-parser minimax-m2 \
 --reasoning-parser minimax \
 --trust-remote-code \
 --tp 2 \
 --mem-fraction-static ${GPU_UTIL} \
 "$@"

 # --reasoning-parser minimax-append-think

🔬 Quantization method

The llmcompressor library was used with the following recipe:

default_stage:
 default_modifiers:
 AWQModifier:
 config_groups:
 mlp_experts_projections:
 # Include only MLP expert weights for 4-bit quantization
 targets: ["re:.*block_sparse_moe\\.experts\\.\\d+\\.(w1|w2|w3)$"]
 weights:
 num_bits: 4
 type: int
 symmetric: true
 group_size: 32
 strategy: group
 dynamic: false
 # actorder: group
 observer: memoryless_minmax

 mappings:
 - smooth_layer: re:.*post_attention_layernorm$
 balance_layers: ["re:.*w1$", "re:.*w3$"]
 - smooth_layer: re:.*w3$
 balance_layers: ["re:.*w2$"]
 duo_scaling: true

The calibration set had 590 examples, 8192 sequence length, 60 programming languages, 12 spoken languages and is detailed at calibrate_software_engineer.yaml