Voozh

Microsoft MAI-Thinking-1 is the company's highest-performing reasoning model to date. It scores 97.0% on AIME 2025 and matches Claude Opus 4.8 on SWE-Bench Pro. The catch: it is not open-weight. Access is through Microsoft Foundry (private preview as of June 2026; broader access has not been officially confirmed). There are no downloadable weights and no HuggingFace release.

This guide covers what MAI-Thinking-1 actually is, how to access it via the Foundry API, and which open-weight reasoning models you can self-host on GPU cloud when you need full inference control. For context on what reasoning tokens actually cost at scale, see the reasoning model inference cost guide.

What Is Microsoft MAI-Thinking-1

MAI-Thinking-1 is Microsoft's first reasoning model trained entirely in-house. Unlike earlier Microsoft open releases (Phi series, MAI-DS-R1, which used distillation), this model was trained end-to-end without borrowing outputs from third-party models. That gives it a clean training lineage for organizations with strict enterprise licensing requirements.

Architecture: sparse Mixture-of-Experts. The 35B active parameter figure reflects how MoE routing works: each token activates only a subset of experts, so only 35B parameters compute per forward pass. The total parameter count, including all expert weights that must reside in VRAM, is approximately 1T. At FP8 (1 byte per parameter), that is roughly 1 TB of weight storage before KV cache or framework overhead. This is why Microsoft distributes MAI-Thinking-1 as an API service rather than open weights.

Key specs:

Spec	Value
Architecture	Sparse MoE
Active Parameters	35B per token
Total Parameters	~1T
Context Window	256K tokens
Reasoning Mode	Extended chain-of-thought
Availability	Microsoft Foundry (private preview); broader access not yet officially confirmed
Fine-Tuning	Not announced as of Build 2026
Announced	Build 2026, June 2, 2026

For a comparable open-model deployment walkthrough, see deploying GPT-OSS on GPU cloud.

Benchmark Results: MAI-Thinking-1 vs Open Reasoning Models

Published scores from Microsoft's Build 2026 announcement:

Benchmark	MAI-Thinking-1	Nemotron Ultra 253B	Notes
AIME 2025	97.0%	72.5%	Microsoft official, Build 2026
AIME 2026	94.5%	Not published	Microsoft official, Build 2026
SWE-Bench Pro	Matches Claude Opus 4.8	Not published	Microsoft official comparison
GPQA Diamond	Not yet published	76.01%	NVIDIA official
MATH-500	Not yet published	97.0%	NVIDIA official

The gap between MAI-Thinking-1 (97.0% AIME 2025) and the best self-hostable models (~72.5% for Nemotron Ultra) is large on pure math reasoning. For teams that can accept API latency and do not require on-premises deployment, MAI-Thinking-1 via Foundry is significantly stronger on reasoning tasks.

For a broader comparison across the reasoning model landscape, see DeepSeek vs Llama 4 vs Qwen 3.

Why MAI-Thinking-1 Is Not Self-Hostable

The ~1T total parameter count is the core constraint. At FP8, weights alone require approximately 1 TB of VRAM. With 15% framework overhead, the minimum before KV cache is roughly 1.15 TB.

To put that in context:

Model	Total Params	FP8 Weight VRAM	Min Single-Node Config
MAI-Thinking-1	~1T	~1,000 GB	Beyond single 8x H200 node
DeepSeek R2	~685B (unconfirmed)	~685 GB (est.)	4x H200 SXM5 (564 GB) at FP8 + selective INT4 quantization
Nemotron Ultra 253B	253B (dense)	~253 GB	2x H200 SXM5 at FP8

MAI-Thinking-1 at 1T parameters exceeds what a single 8x H200 SXM5 node (1,128 GB total VRAM) could serve without multi-node NVLink fabric. Microsoft has not announced a timeline for open-weight release. If that changes, the self-hosting math would need to be redone from the actual confirmed parameter count and architecture details.

For the theory behind why MoE total parameter count drives GPU selection, see the MoE inference optimization guide.

Accessing MAI-Thinking-1 via Microsoft Foundry

Step 1: Request Access

Go to microsoft.ai/models/mai-thinking-1 and apply for private preview access through Microsoft Foundry. Access is currently in private preview; broader availability has not been officially confirmed from the Build 2026 coverage. Enterprise access through Azure AI is expected to follow any future public release.

Step 2: Call the API

MAI-Thinking-1 is served via a Chat Completions-compatible endpoint. Once you have Foundry credentials:

python

import openai

client = openai.OpenAI(
 base_url="<your-microsoft-foundry-endpoint>",
 api_key="<your-foundry-api-key>"
)

response = client.chat.completions.create(
 model="mai-thinking-1",
 messages=[
 {
 "role": "user",
 "content": "Solve: A train travels 120 miles at 60 mph, then 80 miles at 40 mph. What is the average speed for the entire trip?"
 }
 ],
 max_tokens=4096,
 temperature=0.1
)

print(response.choices[0].message.content)

The model generates an extended reasoning trace before the final answer. If responses do not include thinking tokens, check the model card for any required system prompt or template format once it becomes available.

What About Fine-Tuning?

As of Build 2026, Microsoft has not announced fine-tuning access to MAI-Thinking-1 weights. Access is inference-only via the Foundry API. If fine-tuning support is added, Microsoft will announce it through official Foundry documentation.

Self-Hostable Open Reasoning Alternatives

For teams that need full inference control, on-premises deployment, or data residency requirements, the current top open-weight reasoning models are DeepSeek R2 and Nemotron Ultra 253B.

DeepSeek R2

DeepSeek R2 is a large sparse MoE model expected to have open weights available for self-hosting. At FP8 with selective INT4 quantization on some layers, it fits on a 4x H200 SXM5 node (564 GB combined VRAM). It is the closest open alternative to MAI-Thinking-1 in terms of reasoning tier.

For the full deployment setup, see the DeepSeek R2 deployment guide.

Quick start for 4x H200 SXM5:

bash

vllm serve /models/deepseek-r2 \
 --dtype fp8 \
 --quantization compressed_tensors \
 --tensor-parallel-size 4 \
 --max-model-len 65536 \
 --kv-cache-dtype fp8_e5m2 \
 --gpu-memory-utilization 0.90 \
 --enable-expert-parallel \
 --trust-remote-code \
 --port 8000

Important: --enable-expert-parallel requires 2+ GPUs. For a single-GPU setup, drop --tensor-parallel-size to 1 and also remove --enable-expert-parallel. Running expert parallelism on a single GPU causes an error. For the full vLLM setup on Spheron including tensor parallel configuration options, see the vLLM server guide.

Nemotron Ultra 253B

NVIDIA's Nemotron Ultra 253B is a dense transformer (not MoE) with 253B parameters. Scores: 76.01% GPQA Diamond, 72.5% AIME 2025. As a dense model, it does not have MoE routing complexity and is simpler to serve.

At FP8, 253B parameters require about 253 GB of VRAM. A 2x H200 SXM5 node (282 GB) fits the weights with some KV cache headroom for practical context lengths.

For the full setup, see the Nemotron Ultra 253B deployment guide.

GPU Hardware and Pricing for Open Alternatives

Pricing fetched from the Spheron API on 16 Jun 2026.

H200 SXM5 for DeepSeek R2

DeepSeek R2 needs 4x H200 SXM5 at FP8 with selective INT4 quantization minimum. On-demand pricing at $4.96/hr per GPU:

Config	On-Demand/hr	Spot/hr	Spot Savings
1x H200 SXM5	$4.96	$3.31	~33%
2x H200 SXM5	$9.92	$6.62	~33%
4x H200 SXM5	$19.84	$13.24	~33%

H200 SXM5 on Spheron includes NVLink by default on SXM5 configurations, which reduces all-reduce latency per layer compared to PCIe interconnect. This matters for tensor parallel efficiency at TP=4.

B200 SXM6 for Smaller Open Models

Nemotron Ultra 253B at FP8 fits on 2x B200 SXM6 (384 GB combined) with good KV cache budget. A single B200 SXM6 (192 GB) is only viable via MXFP4 (Blackwell FP4, ~130 GB), not FP8, since 253B at FP8 requires ~253 GB of weight storage alone.

Config	On-Demand/hr	Spot/hr	Spot Savings
1x B200 SXM6	$9.30	$2.74	~71%
2x B200 SXM6	$18.60	$5.48	~71%

B200 SXM6 on Spheron also supports MXFP4 (0.5 bytes per parameter), which is Blackwell-only and can halve the weight footprint again, enabling Nemotron Ultra 253B on a single B200 at ~130 GB.

H100 SXM5 for Cost-Optimized Deployment

At $3.98/hr per GPU, 4x H100 SXM5 ($15.92/hr combined) is a cost-efficient option for Nemotron Ultra 253B at FP8 (320 GB combined VRAM covers 253B FP8 weights with headroom). DeepSeek R2 at FP8 with selective INT4 quantization needs 8x H100 SXM5 minimum.

Pricing fluctuates based on GPU availability. The prices above are based on 16 Jun 2026 and may have changed. Check current GPU pricing for live rates.

Cost Comparison: MAI-Thinking-1 API vs Self-Hosting an Open Alternative

This is a trade-off between capability and control. MAI-Thinking-1 at 97% AIME 2025 leads on reasoning accuracy, but it is API-only. The open alternatives score lower on benchmarks but offer full control.

Option	Platform	Cost/hr	Cost/month (24/7)
MAI-Thinking-1 inference	Microsoft Foundry	Per-token (pricing TBD)	N/A
DeepSeek R2	Spheron, 4x H200 on-demand	$19.84	~$14,285
DeepSeek R2	Spheron, 4x H200 spot	$13.24	~$9,533
Nemotron Ultra 253B	Spheron, 2x H200 on-demand	$9.92	~$7,142
Nemotron Ultra 253B (MXFP4)	Spheron, 1x B200 spot	$2.74	~$1,973

For sustained high-volume inference where DeepSeek R2 or Nemotron Ultra benchmark scores are sufficient, self-hosting on Spheron typically beats per-token API pricing at roughly 100-300 million tokens per month of sustained output, depending on what Microsoft prices Foundry access at.

When to Use MAI-Thinking-1 API vs Self-Host an Open Alternative

Scenario	Best Choice	Why
Maximum math/reasoning accuracy	MAI-Thinking-1 via Foundry	97% AIME 2025, top-tier reasoning
On-premises or data residency requirements	DeepSeek R2 or Nemotron Ultra	Open weights, full control
Tightest budget, batch workloads	Nemotron Ultra 253B on spot B200	$2.74/hr, fits on single B200 via MXFP4 (Blackwell FP4)
High-volume sustained inference	DeepSeek R2 on H200 spot	Open weights, lower per-token cost at scale
Clean enterprise licensing, self-hosted	Nemotron Ultra 253B	NVIDIA origin, no distillation concerns, open weights
Cost-efficient 256K context serving	DeepSeek R2 on 4x H200	Full 256K at batch-4 on 564 GB combined VRAM

MAI-Thinking-1 sets a new bar for reasoning model accuracy at 97.0% AIME 2025, but it runs as an API service for now. For self-hosted inference with open weights, H200 SXM5 instances on Spheron support both DeepSeek R2 and Nemotron Ultra 253B with per-minute billing and no contracts.
B200 SXM6 on Spheron | View all GPU pricing
Start deploying now

STEPS / 05

Quick Setup Guide

Request access to MAI-Thinking-1 via Microsoft Foundry
Visit microsoft.ai/models/mai-thinking-1 and apply for private preview access through Microsoft Foundry. Once approved, you receive API credentials for the Chat Completions-compatible endpoint. Access remains in private preview as of June 2026; broader availability has not been officially confirmed.
Call the MAI-Thinking-1 API via Chat Completions
MAI-Thinking-1 is served via a Chat Completions-compatible API endpoint. Configure your base URL to the Microsoft Foundry endpoint and set your API key. Send requests with the standard messages array format. Use temperature 0.1 for deterministic math and coding tasks. The model generates an extended reasoning trace before its final answer.
Provision Spheron H200 SXM5 for an open reasoning model alternative
Log in to app.spheron.ai and select H200 SXM5 or H100 SXM5. For DeepSeek R2 (roughly R1/V3-class, ~685B total parameters, unconfirmed), provision 4x H200 SXM5 minimum (564 GB combined VRAM). For Nemotron Ultra 253B (dense model, 253B params), 2x H200 SXM5 or 4x H100 SXM5 at FP8 covers the weights with KV cache headroom. Provision at least 500 GB of persistent storage for model weights.
Deploy an open reasoning model with vLLM expert-parallel configuration
For DeepSeek R2 on 4x H200: vllm serve /models/deepseek-r2 --dtype fp8 --quantization compressed_tensors --tensor-parallel-size 4 --max-model-len 65536 --kv-cache-dtype fp8_e5m2 --gpu-memory-utilization 0.90 --enable-expert-parallel --trust-remote-code --port 8000. The --quantization compressed_tensors flag enables the mixed FP8/INT4 layer quantization required to fit the ~685B model within 564 GB of combined VRAM. For 2-GPU configurations, drop --tensor-parallel-size to 2. For single-GPU setups, drop it to 1 and also remove --enable-expert-parallel entirely, since expert parallelism requires 2+ GPUs and will error on a single GPU.
Test reasoning output and validate chain-of-thought generation
Send a multi-step math or logic problem to /v1/chat/completions. Verify the model generates an extended reasoning trace before the final answer. Check GPU utilization with nvidia-smi dmon -s u. Monitor vLLM's /metrics endpoint for vllm:gpu_cache_usage_perc and alert above 85% to prevent KV eviction cascades.

FAQ / 05

Frequently Asked Questions

Not currently. As of June 2026, MAI-Thinking-1 is available only via Microsoft Foundry (private preview). There are no open weights and no HuggingFace release. Broader access has not been officially confirmed. For self-hosted inference with comparable reasoning capability, DeepSeek R2 and Nemotron Ultra 253B are the current open-weight alternatives.

MAI-Thinking-1 has 35B active parameters but approximately 1T total parameters as a sparse MoE model. At FP8, model weights alone require about 1 TB of VRAM, putting it firmly in multi-node territory. This is why Microsoft distributes it as an API service rather than open weights. Comparable open alternatives like DeepSeek R2 (roughly R1/V3-class, ~685B total, unconfirmed) need 4x H200 SXM5 or 8x H100 SXM5 minimum, at FP8 with selective INT4 quantization on some layers.

MAI-Thinking-1 scores 97.0% on AIME 2025 and 94.5% on AIME 2026, placing it at the top of public reasoning benchmarks. Nemotron Ultra 253B scores 72.5% on AIME 2025. The gap is substantial on pure math. MAI-Thinking-1 is stronger on reasoning benchmarks, but it is API-only. DeepSeek R2 and Nemotron Ultra are self-hostable with open weights.

Self-hosting DeepSeek R2 on 4x H200 SXM5 at Spheron on-demand costs $19.84/hr ($14,285/month). Nemotron Ultra 253B on 2x H200 SXM5 costs $9.92/hr ($7,142/month). Microsoft has not yet published per-token pricing for MAI-Thinking-1 Foundry access. At high sustained volumes (500M+ tokens/month), self-hosting an open alternative typically beats per-token API pricing.

DeepSeek R2 is the closest open-weight alternative in the top-tier reasoning space, with full weights available for self-hosting. Nemotron Ultra 253B by NVIDIA is the top open dense model. Both run on Spheron H200 SXM5 instances with NVLink for multi-GPU inference.

URL: https://www.spheron.network/blog/deploy-mai-thinking-1-gpu-cloud/

⇱ Microsoft MAI-Thinking-1: API Access and Open Self-Host Alternatives | Spheron Blog