Microsoft MAI-Thinking-1 is the company's highest-performing reasoning model to date. It scores 97.0% on AIME 2025 and matches Claude Opus 4.8 on SWE-Bench Pro. The catch: it is not open-weight. Access is through Microsoft Foundry (private preview as of June 2026; broader access has not been officially confirmed). There are no downloadable weights and no HuggingFace release.
This guide covers what MAI-Thinking-1 actually is, how to access it via the Foundry API, and which open-weight reasoning models you can self-host on GPU cloud when you need full inference control. For context on what reasoning tokens actually cost at scale, see the reasoning model inference cost guide.
What Is Microsoft MAI-Thinking-1
MAI-Thinking-1 is Microsoft's first reasoning model trained entirely in-house. Unlike earlier Microsoft open releases (Phi series, MAI-DS-R1, which used distillation), this model was trained end-to-end without borrowing outputs from third-party models. That gives it a clean training lineage for organizations with strict enterprise licensing requirements.
Architecture: sparse Mixture-of-Experts. The 35B active parameter figure reflects how MoE routing works: each token activates only a subset of experts, so only 35B parameters compute per forward pass. The total parameter count, including all expert weights that must reside in VRAM, is approximately 1T. At FP8 (1 byte per parameter), that is roughly 1 TB of weight storage before KV cache or framework overhead. This is why Microsoft distributes MAI-Thinking-1 as an API service rather than open weights.
Key specs:
| Spec | Value |
|---|---|
| Architecture | Sparse MoE |
| Active Parameters | 35B per token |
| Total Parameters | ~1T |
| Context Window | 256K tokens |
| Reasoning Mode | Extended chain-of-thought |
| Availability | Microsoft Foundry (private preview); broader access not yet officially confirmed |
| Fine-Tuning | Not announced as of Build 2026 |
| Announced | Build 2026, June 2, 2026 |
For a comparable open-model deployment walkthrough, see deploying GPT-OSS on GPU cloud.
Benchmark Results: MAI-Thinking-1 vs Open Reasoning Models
Published scores from Microsoft's Build 2026 announcement:
| Benchmark | MAI-Thinking-1 | Nemotron Ultra 253B | Notes |
|---|---|---|---|
| AIME 2025 | 97.0% | 72.5% | Microsoft official, Build 2026 |
| AIME 2026 | 94.5% | Not published | Microsoft official, Build 2026 |
| SWE-Bench Pro | Matches Claude Opus 4.8 | Not published | Microsoft official comparison |
| GPQA Diamond | Not yet published | 76.01% | NVIDIA official |
| MATH-500 | Not yet published | 97.0% | NVIDIA official |
The gap between MAI-Thinking-1 (97.0% AIME 2025) and the best self-hostable models (~72.5% for Nemotron Ultra) is large on pure math reasoning. For teams that can accept API latency and do not require on-premises deployment, MAI-Thinking-1 via Foundry is significantly stronger on reasoning tasks.
For a broader comparison across the reasoning model landscape, see DeepSeek vs Llama 4 vs Qwen 3.
Why MAI-Thinking-1 Is Not Self-Hostable
The ~1T total parameter count is the core constraint. At FP8, weights alone require approximately 1 TB of VRAM. With 15% framework overhead, the minimum before KV cache is roughly 1.15 TB.
To put that in context:
| Model | Total Params | FP8 Weight VRAM | Min Single-Node Config |
|---|---|---|---|
| MAI-Thinking-1 | ~1T | ~1,000 GB | Beyond single 8x H200 node |
| DeepSeek R2 | ~685B (unconfirmed) | ~685 GB (est.) | 4x H200 SXM5 (564 GB) at FP8 + selective INT4 quantization |
| Nemotron Ultra 253B | 253B (dense) | ~253 GB | 2x H200 SXM5 at FP8 |
MAI-Thinking-1 at 1T parameters exceeds what a single 8x H200 SXM5 node (1,128 GB total VRAM) could serve without multi-node NVLink fabric. Microsoft has not announced a timeline for open-weight release. If that changes, the self-hosting math would need to be redone from the actual confirmed parameter count and architecture details.
For the theory behind why MoE total parameter count drives GPU selection, see the MoE inference optimization guide.
Accessing MAI-Thinking-1 via Microsoft Foundry
Step 1: Request Access
Go to microsoft.ai/models/mai-thinking-1 and apply for private preview access through Microsoft Foundry. Access is currently in private preview; broader availability has not been officially confirmed from the Build 2026 coverage. Enterprise access through Azure AI is expected to follow any future public release.
Step 2: Call the API
MAI-Thinking-1 is served via a Chat Completions-compatible endpoint. Once you have Foundry credentials:
import openai
client = openai.OpenAI(
base_url="<your-microsoft-foundry-endpoint>",
api_key="<your-foundry-api-key>"
)
response = client.chat.completions.create(
model="mai-thinking-1",
messages=[
{
"role": "user",
"content": "Solve: A train travels 120 miles at 60 mph, then 80 miles at 40 mph. What is the average speed for the entire trip?"
}
],
max_tokens=4096,
temperature=0.1
)
print(response.choices[0].message.content)The model generates an extended reasoning trace before the final answer. If responses do not include thinking tokens, check the model card for any required system prompt or template format once it becomes available.
What About Fine-Tuning?
As of Build 2026, Microsoft has not announced fine-tuning access to MAI-Thinking-1 weights. Access is inference-only via the Foundry API. If fine-tuning support is added, Microsoft will announce it through official Foundry documentation.
Self-Hostable Open Reasoning Alternatives
For teams that need full inference control, on-premises deployment, or data residency requirements, the current top open-weight reasoning models are DeepSeek R2 and Nemotron Ultra 253B.
DeepSeek R2
DeepSeek R2 is a large sparse MoE model expected to have open weights available for self-hosting. At FP8 with selective INT4 quantization on some layers, it fits on a 4x H200 SXM5 node (564 GB combined VRAM). It is the closest open alternative to MAI-Thinking-1 in terms of reasoning tier.
For the full deployment setup, see the DeepSeek R2 deployment guide.
Quick start for 4x H200 SXM5:
vllm serve /models/deepseek-r2 \
--dtype fp8 \
--quantization compressed_tensors \
--tensor-parallel-size 4 \
--max-model-len 65536 \
--kv-cache-dtype fp8_e5m2 \
--gpu-memory-utilization 0.90 \
--enable-expert-parallel \
--trust-remote-code \
--port 8000Important: --enable-expert-parallel requires 2+ GPUs. For a single-GPU setup, drop --tensor-parallel-size to 1 and also remove --enable-expert-parallel. Running expert parallelism on a single GPU causes an error. For the full vLLM setup on Spheron including tensor parallel configuration options, see the vLLM server guide.
Nemotron Ultra 253B
NVIDIA's Nemotron Ultra 253B is a dense transformer (not MoE) with 253B parameters. Scores: 76.01% GPQA Diamond, 72.5% AIME 2025. As a dense model, it does not have MoE routing complexity and is simpler to serve.
At FP8, 253B parameters require about 253 GB of VRAM. A 2x H200 SXM5 node (282 GB) fits the weights with some KV cache headroom for practical context lengths.
For the full setup, see the Nemotron Ultra 253B deployment guide.
GPU Hardware and Pricing for Open Alternatives
Pricing fetched from the Spheron API on 16 Jun 2026.
H200 SXM5 for DeepSeek R2
DeepSeek R2 needs 4x H200 SXM5 at FP8 with selective INT4 quantization minimum. On-demand pricing at $4.96/hr per GPU:
| Config | On-Demand/hr | Spot/hr | Spot Savings |
|---|---|---|---|
| 1x H200 SXM5 | $4.96 | $3.31 | ~33% |
| 2x H200 SXM5 | $9.92 | $6.62 | ~33% |
| 4x H200 SXM5 | $19.84 | $13.24 | ~33% |
H200 SXM5 on Spheron includes NVLink by default on SXM5 configurations, which reduces all-reduce latency per layer compared to PCIe interconnect. This matters for tensor parallel efficiency at TP=4.
B200 SXM6 for Smaller Open Models
Nemotron Ultra 253B at FP8 fits on 2x B200 SXM6 (384 GB combined) with good KV cache budget. A single B200 SXM6 (192 GB) is only viable via MXFP4 (Blackwell FP4, ~130 GB), not FP8, since 253B at FP8 requires ~253 GB of weight storage alone.
| Config | On-Demand/hr | Spot/hr | Spot Savings |
|---|---|---|---|
| 1x B200 SXM6 | $9.30 | $2.74 | ~71% |
| 2x B200 SXM6 | $18.60 | $5.48 | ~71% |
B200 SXM6 on Spheron also supports MXFP4 (0.5 bytes per parameter), which is Blackwell-only and can halve the weight footprint again, enabling Nemotron Ultra 253B on a single B200 at ~130 GB.
H100 SXM5 for Cost-Optimized Deployment
At $3.98/hr per GPU, 4x H100 SXM5 ($15.92/hr combined) is a cost-efficient option for Nemotron Ultra 253B at FP8 (320 GB combined VRAM covers 253B FP8 weights with headroom). DeepSeek R2 at FP8 with selective INT4 quantization needs 8x H100 SXM5 minimum.
Pricing fluctuates based on GPU availability. The prices above are based on 16 Jun 2026 and may have changed. Check current GPU pricing for live rates.
Cost Comparison: MAI-Thinking-1 API vs Self-Hosting an Open Alternative
This is a trade-off between capability and control. MAI-Thinking-1 at 97% AIME 2025 leads on reasoning accuracy, but it is API-only. The open alternatives score lower on benchmarks but offer full control.
| Option | Platform | Cost/hr | Cost/month (24/7) |
|---|---|---|---|
| MAI-Thinking-1 inference | Microsoft Foundry | Per-token (pricing TBD) | N/A |
| DeepSeek R2 | Spheron, 4x H200 on-demand | $19.84 | ~$14,285 |
| DeepSeek R2 | Spheron, 4x H200 spot | $13.24 | ~$9,533 |
| Nemotron Ultra 253B | Spheron, 2x H200 on-demand | $9.92 | ~$7,142 |
| Nemotron Ultra 253B (MXFP4) | Spheron, 1x B200 spot | $2.74 | ~$1,973 |
For sustained high-volume inference where DeepSeek R2 or Nemotron Ultra benchmark scores are sufficient, self-hosting on Spheron typically beats per-token API pricing at roughly 100-300 million tokens per month of sustained output, depending on what Microsoft prices Foundry access at.
When to Use MAI-Thinking-1 API vs Self-Host an Open Alternative
| Scenario | Best Choice | Why |
|---|---|---|
| Maximum math/reasoning accuracy | MAI-Thinking-1 via Foundry | 97% AIME 2025, top-tier reasoning |
| On-premises or data residency requirements | DeepSeek R2 or Nemotron Ultra | Open weights, full control |
| Tightest budget, batch workloads | Nemotron Ultra 253B on spot B200 | $2.74/hr, fits on single B200 via MXFP4 (Blackwell FP4) |
| High-volume sustained inference | DeepSeek R2 on H200 spot | Open weights, lower per-token cost at scale |
| Clean enterprise licensing, self-hosted | Nemotron Ultra 253B | NVIDIA origin, no distillation concerns, open weights |
| Cost-efficient 256K context serving | DeepSeek R2 on 4x H200 | Full 256K at batch-4 on 564 GB combined VRAM |
MAI-Thinking-1 sets a new bar for reasoning model accuracy at 97.0% AIME 2025, but it runs as an API service for now. For self-hosted inference with open weights, H200 SXM5 instances on Spheron support both DeepSeek R2 and Nemotron Ultra 253B with per-minute billing and no contracts.
Quick Setup Guide
Request access to MAI-Thinking-1 via Microsoft Foundry
Visit microsoft.ai/models/mai-thinking-1 and apply for private preview access through Microsoft Foundry. Once approved, you receive API credentials for the Chat Completions-compatible endpoint. Access remains in private preview as of June 2026; broader availability has not been officially confirmed.
Call the MAI-Thinking-1 API via Chat Completions
MAI-Thinking-1 is served via a Chat Completions-compatible API endpoint. Configure your base URL to the Microsoft Foundry endpoint and set your API key. Send requests with the standard messages array format. Use temperature 0.1 for deterministic math and coding tasks. The model generates an extended reasoning trace before its final answer.
Provision Spheron H200 SXM5 for an open reasoning model alternative
Log in to app.spheron.ai and select H200 SXM5 or H100 SXM5. For DeepSeek R2 (roughly R1/V3-class, ~685B total parameters, unconfirmed), provision 4x H200 SXM5 minimum (564 GB combined VRAM). For Nemotron Ultra 253B (dense model, 253B params), 2x H200 SXM5 or 4x H100 SXM5 at FP8 covers the weights with KV cache headroom. Provision at least 500 GB of persistent storage for model weights.
Deploy an open reasoning model with vLLM expert-parallel configuration
For DeepSeek R2 on 4x H200: vllm serve /models/deepseek-r2 --dtype fp8 --quantization compressed_tensors --tensor-parallel-size 4 --max-model-len 65536 --kv-cache-dtype fp8_e5m2 --gpu-memory-utilization 0.90 --enable-expert-parallel --trust-remote-code --port 8000. The --quantization compressed_tensors flag enables the mixed FP8/INT4 layer quantization required to fit the ~685B model within 564 GB of combined VRAM. For 2-GPU configurations, drop --tensor-parallel-size to 2. For single-GPU setups, drop it to 1 and also remove --enable-expert-parallel entirely, since expert parallelism requires 2+ GPUs and will error on a single GPU.
Test reasoning output and validate chain-of-thought generation
Send a multi-step math or logic problem to /v1/chat/completions. Verify the model generates an extended reasoning trace before the final answer. Check GPU utilization with nvidia-smi dmon -s u. Monitor vLLM's /metrics endpoint for vllm:gpu_cache_usage_perc and alert above 85% to prevent KV eviction cascades.
Frequently Asked Questions
Not currently. As of June 2026, MAI-Thinking-1 is available only via Microsoft Foundry (private preview). There are no open weights and no HuggingFace release. Broader access has not been officially confirmed. For self-hosted inference with comparable reasoning capability, DeepSeek R2 and Nemotron Ultra 253B are the current open-weight alternatives.
MAI-Thinking-1 has 35B active parameters but approximately 1T total parameters as a sparse MoE model. At FP8, model weights alone require about 1 TB of VRAM, putting it firmly in multi-node territory. This is why Microsoft distributes it as an API service rather than open weights. Comparable open alternatives like DeepSeek R2 (roughly R1/V3-class, ~685B total, unconfirmed) need 4x H200 SXM5 or 8x H100 SXM5 minimum, at FP8 with selective INT4 quantization on some layers.
MAI-Thinking-1 scores 97.0% on AIME 2025 and 94.5% on AIME 2026, placing it at the top of public reasoning benchmarks. Nemotron Ultra 253B scores 72.5% on AIME 2025. The gap is substantial on pure math. MAI-Thinking-1 is stronger on reasoning benchmarks, but it is API-only. DeepSeek R2 and Nemotron Ultra are self-hostable with open weights.
Self-hosting DeepSeek R2 on 4x H200 SXM5 at Spheron on-demand costs $19.84/hr ($14,285/month). Nemotron Ultra 253B on 2x H200 SXM5 costs $9.92/hr ($7,142/month). Microsoft has not yet published per-token pricing for MAI-Thinking-1 Foundry access. At high sustained volumes (500M+ tokens/month), self-hosting an open alternative typically beats per-token API pricing.
DeepSeek R2 is the closest open-weight alternative in the top-tier reasoning space, with full weights available for self-hosting. Nemotron Ultra 253B by NVIDIA is the top open dense model. Both run on Spheron H200 SXM5 instances with NVLink for multi-GPU inference.
