MiniMax M2.1 (Mixed-Precision BF16 + INT4 AWQ)
This strives to be the highest quality quant that can run on 192GiB VRAM
💡This is a sister model to mratsim/MiniMax-M2.1-FP8-INT4-AWQ with the original model FP8 weights pre-dequantized to BF16.
This makes it compatible with 8x3090 systems (which don't have hardware FP8) and also compatible with SGLang for an extra 3 GiB in VRAM.
It features:
That model has ensured that all experts are calibrated, not doing so is extremely detrimental, PR: https://github.com/vllm-project/llm-compressor/pull/2171
Mixed precision with:
- self-attention weights copied directly from the official version (default FP8 with 2D-blocks)
- experts weights quantized using AWQ W4A16G32 scheme (4-bit weights, 16-bit activations, scaling factor per group of 32 weights)
High-quality large and diverse dataset with programming and devops focus as well as domain-specific knowledge (math, sciences, medical, finance, business, humanities, philosophy, creative writing), general knowledge, pop culture and behavioral situations because we never code in a vacuum. And we want to make sure all experts are calibrated to the full range of their activations.
Calibration explicitly tests multilingual capabilities:
- Asia: Chinese, Hindi, Korean, Japanese
- Europe: French, German, Portuguese, Russian, Spanish
- Middle-East: Arabic, Hebrew, Turkish
Calibration explicitly tests 60 programming languages and not just Python:
- Imperative programming: C, C++, Go, Zig, ...
- Functional programming: Haskell, F#, OCaml, Erlang, Lisp, Clojure ...
- Web-focused: HTML/CSS, Typescript, PHP, ...
- Mixed paradigm: D, Kotlin, Nim, Rust, Swift, ...
- Theorem provers: Coq, Lean
- Low-level: ARM64 assembly, x86-64 assembly, LLVM IR
- GPU Programming: Cuda, Vulkan, Apple Metal
- Game Programming: GDScript, GLSL
- Domain-specific: MATLAB, Julia, Solidity, R
Calibration tries to ensure coverage for a wide variety of experience (from explaining concepts to your grandmother to debugging Kubernetes logs)
Built by a dev, for devs (and it looks very good for STEM as well)
It uses my new declarative quantization framework https://github.com/mratsim/quantizers which facilitates highly-tuned calibration sets: calibrate_software_engineer.yaml
📥 Usage & Running Instructions
The model was tested with SGLang + 2x RTX Pro 6000, here is a script suitable for such configuration with the maximum 196,608 context length. This uses 92.5GiB of VRAM with the flashinfer backend.
Please refer to mratsim/MiniMax-M2.1-FP8-INT4-AWQ#running-script for running it in vLLM
Running script
--trust-remote-code is necessary until the transformers team merges github.com/huggingface/transformers/pull/42028
You have 2 reasoning parsers;
minimax, puts the reasoning content in a special field like DeepSeek models that is usually rendered in a specific manner in frontends.minimax_append_think, puts the reasoning into<think>reasoning_content</think>and that is sent as normal text. Few frontends properly render that, I'm aware of Cherry Studio on Desktop and ChatterUI on Android.
The reason why minimax_append_think was introduced was Interleaved Thinking and having the model build upon it's previous thinking (usually frontends discard the thinking trace)
💡In the sister model, I mentioned that with the recommended parameters the model tends to get stuck in repetition loops.
This does not seem to happen with SGLang hence "repetition_penalty: 1.10, frequency_penalty: 0.40" are not used.
There is no way to override such settings without editing generation_config.json anyway: https://github.com/sgl-project/sglang/issues/15487
# Model configuration (Mandatory)
MODEL="mratsim/MiniMax-M2.1-BF16-INT4-AWQ"
MODELNAME="MiniMax-M2.1"
GPU_UTIL=0.93
SGLANG_PORT=8000
# Prevent memory fragmentation
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
python3 -m sglang.launch_server \
--host 0.0.0.0 \
--port "${SGLANG_PORT}" \
--sleep-on-idle \
--disable-custom-all-reduce \
--max-running-requests 64 \
--cuda-graph-max-bs 64 \
--attention-backend flashinfer \
--served-model-name "${MODELNAME}" \
--model-path "${MODEL}" \
--tool-call-parser minimax-m2 \
--reasoning-parser minimax \
--trust-remote-code \
--tp 2 \
--mem-fraction-static ${GPU_UTIL} \
"$@"
# --reasoning-parser minimax-append-think
🔬 Quantization method
The llmcompressor library was used with the following recipe:
default_stage:
default_modifiers:
AWQModifier:
config_groups:
mlp_experts_projections:
# Include only MLP expert weights for 4-bit quantization
targets: ["re:.*block_sparse_moe\\.experts\\.\\d+\\.(w1|w2|w3)$"]
weights:
num_bits: 4
type: int
symmetric: true
group_size: 32
strategy: group
dynamic: false
# actorder: group
observer: memoryless_minmax
mappings:
- smooth_layer: re:.*post_attention_layernorm$
balance_layers: ["re:.*w1$", "re:.*w3$"]
- smooth_layer: re:.*w3$
balance_layers: ["re:.*w2$"]
duo_scaling: true
The calibration set had 590 examples, 8192 sequence length, 60 programming languages, 12 spoken languages and is detailed at calibrate_software_engineer.yaml
Quantization theory and heuristics for manual tuning
- Downloads last month
- 9
Model tree for mratsim/MiniMax-M2.1-BF16-INT4-AWQ
Base model
MiniMaxAI/MiniMax-M2.1