VOOZH about

URL: https://huggingface.co/lovethayo/MiniMax-M2.5-FP8-INT4-AWQ

⇱ lovethayo/MiniMax-M2.5-FP8-INT4-AWQ Β· Hugging Face


MiniMax M2.5 (Mixed-Precision FP8 + INT4 AWQ FrankenQuant)

This strives to be the highest quality quant that can run on 192GiB VRAM

πŸ’‘ A non-FP8 version is available at mratsim/MiniMax-M2.5-BF16-INT4-AWQ
That version is compatible with 8x RTX 3090s and with SGLang (which doesn't support mixed quantization yet) for an extra 3GiB in VRAM.
This FP8+INT4 AWQ was build by merging the original FP8 self-attention weights and mratsim/MiniMax-M2.5-BF16-INT4-AWQ experts.

It features:

  • That model has ensured that all experts are calibrated, not doing so is extremely detrimental, PR: https://github.com/vllm-project/llm-compressor/pull/2171

  • Mixed precision with:

    • self-attention weights copied directly from the official version (default FP8 with 2D-blocks)
    • experts weights quantized using AWQ W4A16G32 scheme (4-bit weights, 16-bit activations, scaling factor per group of 32 weights)
  • High-quality large and diverse dataset with programming and devops focus as well as domain-specific knowledge (math, sciences, medical, finance, business, humanities, philosophy, creative writing), general knowledge, pop culture and behavioral situations because we never code in a vacuum. And we want to make sure all experts are calibrated to the full range of their activations.

  • Calibration explicitly tests multilingual capabilities:

    • Asia: Chinese, Hindi, Korean, Japanese
    • Europe: French, German, Portuguese, Russian, Spanish
    • Middle-East: Arabic, Hebrew, Turkish
  • Calibration explicitly tests 60 programming languages and not just Python:

    • Imperative programming: C, C++, Go, Zig, ...
    • Functional programming: Haskell, F#, OCaml, Erlang, Lisp, Clojure ...
    • Web-focused: HTML/CSS, Typescript, PHP, ...
    • Mixed paradigm: D, Kotlin, Nim, Rust, Swift, ...
    • Theorem provers: Coq, Lean
    • Low-level: ARM64 assembly, x86-64 assembly, LLVM IR
    • GPU Programming: Cuda, Vulkan, Apple Metal
    • Game Programming: GDScript, GLSL
    • Domain-specific: MATLAB, Julia, Solidity, R
  • Calibration tries to ensure coverage for a wide variety of experience (from explaining concepts to your grandmother to debugging Kubernetes logs)

  • Built by a dev, for devs (and it looks very good for STEM as well)

It uses my new declarative quantization framework https://github.com/mratsim/quantizers which facilitates highly-tuned calibration sets: calibrate_software_engineer.yaml

πŸ“₯ Usage & Running Instructions

The model was tested with vLLM + 2x RTX Pro 6000, here is a script suitable for such configuration with the maximum 196,608 context length. This uses 92.5GiB of VRAM with the flashinfer backend.

⚠️ Due to rope_parameters change, at the moment this model is incompatible with transformers V5.
This makes it incompatible with GLM-4.6V which requires transformers V5. Use different Docker images.

⚠️ SGLang does not support this model due to missing mixed precision support. Feature request raised at https://github.com/sgl-project/sglang/issues/16276.

Please use mratsim/MiniMax-M2.5-BF16-INT4-AWQ in the meantime.

Running script

--trust-remote-code is necessary until the transformers team merges github.com/huggingface/transformers/pull/42028

You have 2 reasoning parsers;

  • minimax_m2, puts the reasoning content in a special field like DeepSeek models that is usually rendered in a specific manner in frontends.
  • minimax_m2_append_think, puts the reasoning into <think>reasoning_content</think> and that is sent as normal text. Few frontends properly render that, I'm aware of Cherry Studio on Desktop and ChatterUI on Android.

The reason why minimax_m2_append_think was introduced was Interleaved Thinking and having the model build upon it's previous thinking (usually frontends discard the thinking trace)

πŸ’‘In MiniMax-M2.1 with the recommended parameters the model tended to get stuck in repetition loops in vLLM
It seemed like repetition_penalty: 1.10, frequency_penalty: 0.40 avoided that.

You may want to try recommended settings without repetition_penalty first (and it slows down token generation)

# Model configuration (Mandatory)
MODEL="mratsim/MiniMax-M2.5-FP8-INT4-AWQ"
MODELNAME="MiniMax-M2.5"
GPU_UTIL=0.93
SAMPLER_OVERRIDE='{"temperature": 1, "top_p": 0.95, "top_k": 40, "repetition_penalty": 1.1, "frequency_penalty": 0.40}'

# Prevent memory fragmentation
export PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512

# Prevent vLLM from using 100% CPU when idle (Very Recommended)
export VLLM_SLEEP_WHEN_IDLE=1

vllm serve "${MODEL}" \
 --served-model-name "${MODELNAME}" \
 --trust-remote-code \
 --gpu-memory-utilization ${GPU_UTIL} \
 --tp 2 \
 --override-generation-config "${SAMPLER_OVERRIDE}" \
 --enable-auto-tool-choice \
 --tool-call-parser minimax_m2 \
 --reasoning-parser minimax_m2
 # --reasoning-parser minimax_m2_append_think

Performance

On dual RTX Pro 6000, I can reach over 5500 prefill/prompt/context processing and over 100 tok/s token generation for a single request.

πŸ‘ image

With PagedAttention in action you can reach over 25000 tok/s in prompt processing speed.

πŸ‘ image

When batching, with default config, you can reach over 6000 even 8000 tok/s and 1200 tok/s generation speed.
Tune prefill vs decode prioritization with --max_num_batched_tokens see Performance & Tuning | vLLM

πŸ‘ image

In a steady state with interleaved prefill and decode requests that interrupt each other, you can get ~2400 tok/s context processing and 800 tok/s generation

πŸ‘ image

Note: vLLM supports prefill-decode disaggregation for high throughput serving if you have double the minimum hardware:

πŸ”¬ Quantization method

Quantization was quite complex for this model and was done in 3 steps:

  1. Original weights are in FP8, they were dequantized to FP16 due to llm-compressor not being able to process FP8.
  2. llm-compressor was used to quantize the MLP experts projection using AWQ, with PR #2171 to ensure they were all activated.
  3. Stitching the FrankenQuant: I combined the original weights, including the 2D-block FP8, with the experts-only AWQ weights.

The llmcompressor library was used with the following recipe:

default_stage:
 default_modifiers:
 AWQModifier:
 config_groups:
 mlp_experts_projections:
 # Include only MLP expert weights for 4-bit quantization
 targets: ["re:.*block_sparse_moe\\.experts\\.\\d+\\.(w1|w2|w3)$"]
 weights:
 num_bits: 4
 type: int
 symmetric: true
 group_size: 32
 strategy: group
 dynamic: false
 # actorder: group
 observer: memoryless_minmax

 mappings:
 - smooth_layer: re:.*post_attention_layernorm$
 balance_layers: ["re:.*w1$", "re:.*w3$"]
 - smooth_layer: re:.*w3$
 balance_layers: ["re:.*w2$"]
 duo_scaling: true

The calibration set had 590 examples, 8192 sequence length, 60 programming languages, 12 spoken languages and is detailed at calibrate_software_engineer.yaml

Quantization theory and heuristics for manual tuning

Downloads last month
5
Safetensors
Model size
39B params
Tensor type
I64
Β·
F32
Β·
I32
Β·
BF16
Β·
F8_E4M3
Β·

Model tree for lovethayo/MiniMax-M2.5-FP8-INT4-AWQ

Quantized
(66)
this model

Datasets used to train lovethayo/MiniMax-M2.5-FP8-INT4-AWQ

Papers for lovethayo/MiniMax-M2.5-FP8-INT4-AWQ