Voozh

TL;DR

MiniMax M3 launched June 1, 2026: open-weight, 1M-token context, MSA sparse attention with 15.6x faster decoding at $0.60/million tokens. Benchmarks and API guide.

MiniMax M3 launched June 1, 2026 with a headline that’s hard to ignore: 59.0% on SWE-Bench Pro at $0.60 per million input tokens. That’s 5–10% of what GPT-5.5 and Gemini 3.1 Pro cost per token on the same benchmark, according to pricing data at launch. If those numbers survive independent verification, M3 is the first open-weight model to put genuine pressure on proprietary frontier model economics.

The caveat: every performance number in this article comes from MiniMax’s own benchmark runs. Third-party evaluations were not available at launch. Weights and a full technical report are scheduled for Hugging Face and GitHub around June 10–11 — that’s when the ML community will confirm or challenge the claims in detail. Until then, this guide covers what’s technically verifiable about the architecture and how to access the API today.

The Architecture: What MSA Actually Does

Standard transformer attention scales quadratically with sequence length. At 1M tokens, that math becomes the primary barrier to both speed and cost — not parameter count. MiniMax Sparse Attention (MSA) attacks this constraint directly, and the approach differs from both mainstream alternatives.

DeepSeek’s Multi-head Latent Attention (MLA) compresses key-value caches before attention computation, trading precision for dramatically smaller KV footprints. FlashAttention and its variants optimize memory access patterns but don’t reduce the fundamental O(n²) compute. MSA takes a third path: it keeps key-values uncompressed and at full floating-point precision, but adds block-level selection on top of a standard Grouped-Query Attention backbone.

The mechanism: for each query, a lightweight routing layer identifies which blocks of the KV cache are actually relevant and discards the rest before computing attention. No precision loss from compression. No wasted compute on irrelevant context. The selection routing adds minimal overhead because it operates at block granularity — large chunks, not individual tokens.

Published results at 1M context length versus MiniMax M2 on the same hardware:

Prefill speed: 9.7x faster (reading the full 1M-token prompt)
Decoding speed: 15.6x faster (generating each output token)
Per-token compute: approximately 1/20th of M2 at maximum context
KV precision: full floating-point maintained (no lossy compression, unlike DeepSeek MLA)

Whether MSA generalizes beyond MiniMax’s internal workloads is an open question the weights release will answer. The full technical report will let independent researchers verify the routing mechanism and measure efficiency across diverse input distributions — including adversarial cases where sparse selection might degrade quality.

Benchmarks: What MiniMax Claims

Four benchmark scores published at launch:

SWE-Bench Pro: 59.0% — MiniMax claims this surpasses both GPT-5.5 and Gemini 3.1 Pro
Terminal-Bench 2.1: 66.0%
SWE-fficiency: 34.8%
BrowseComp: 83.5 — MiniMax claims this edges past Claude Opus 4.7 on autonomous browsing tasks

These numbers come exclusively from MiniMax’s internal evaluation runs. Vendor-run benchmarks tell you the ceiling under optimal conditions, not typical production performance. Two models with the same SWE-Bench score can perform very differently on your actual task distribution.

The SWE-Bench Pro claim deserves particular context. Competing frontier models cluster around 55–65% on Pro. If M3 is genuinely at 59% at $0.60/million tokens, it’s competing in the second tier of the coding benchmark table — not leading it, but significantly above models in its price range. The BrowseComp score is the wilder claim: autonomous browsing is a task class where agent scaffolding matters as much as raw model capability, making benchmark methodology scrutiny important.

The practical move: build a 50-task evaluation suite from your actual production backlog. Run M3, your current model, and one alternative. Vendor benchmarks are a screening filter, not a deployment decision.

Model	Input ($/M tokens)	Output ($/M tokens)	Max Context
MiniMax M3	$0.60 ($0.30 promo)	$2.40 ($1.20 promo)	1M tokens
Gemini 3.5 Flash	$1.50	$9.00	128K tokens
Claude Sonnet 4.6	~$3.00	~$15.00	200K tokens
GPT-5.5	significantly higher	significantly higher	256K tokens

URL: https://wowhow.cloud/blogs/minimax-m3-open-weight-developer-guide-msa-architecture-june-2026

⇱

The Architecture: What MSA Actually Does

Benchmarks: What MiniMax Claims

Key takeaways · 7

Topics

Try Our Free Tools

JSON Formatter & Validator

cURL to Code Converter

More from AI Tool Reviews

Claude Opus 4.8 vs Gemini 3.5 Pro vs GPT-5.6: Developer Model Selection Guide (June 2026)

Pricing: The Cost Argument

Developer Guide: API Access Today

What 1M Context Actually Unlocks

Where M3 Is Not the Right Choice

Open Weights: Why June 10 Matters More Than the Launch

The Honest Take

One insight, every Monday. 7am IST. Zero fluff.

Need production-ready templates?

Comments · 0

Article stats

Regex Playground

Base64 Encoder / Decoder

UUID Generator

OpenCode: 160K Stars, Model-Agnostic, and It Beat Claude Code on Debugging

GLM-5.2: Z.ai Ships 1M-Token Coding Model With Zero Benchmarks

Kimi K2.7-Code: Open-Weight 1T Model That Beats Claude Opus on Tool Use

ChatGPT Dreaming V3: How OpenAI Rebuilt Memory From the Ground Up (June 2026)

Nano Banana Pro (Gemini 3 Pro Image): Developer Guide & API 2026