VOOZH about

URL: https://www.spheron.network/blog/deploy-ministral-3-gpu-cloud/

⇱ Deploy Ministral 3 on GPU Cloud: Self-Host the 3B, 8B, and 14B Reasoning and Vision Models (2026) | Spheron Blog


Not every production workload needs a 119B MoE model. If you want Mistral's instruction, reasoning, and vision capabilities without the multi-GPU overhead of Mistral Small 4, the Ministral 3 family covers the same capability surface in dense 3B, 8B, and 14B checkpoints that fit on a single GPU. This guide covers GPU sizing, vLLM deployment, AWQ quantization, multimodal inference, and an edge-to-cloud routing pattern for all three variants.

The Ministral 3 Family

Ministral 3 is Mistral's December 2025 multi-SKU dense model release. Unlike Small 4's MoE approach, all three size tiers are fully dense transformers, which simplifies serving infrastructure and reduces communication overhead on single-GPU deployments.

ModelVariantVisionActive ParamsContext Window
Ministral 3 3Bbase, instructYes3B128K tokens
Ministral 3 3BreasoningYes3B128K tokens
Ministral 3 8Bbase, instructYes8B128K tokens
Ministral 3 8BreasoningYes8B128K tokens
Ministral 3 14Bbase, instructYes14B128K tokens
Ministral 3 14BreasoningYes14B128K tokens

All variants include image understanding.

The 3B fits in the edge tier: on-device or fractional-GPU serving for latency-sensitive queries. The 8B is the balanced production option: single L40S, good throughput, covers most instruction and vision workloads. The 14B Reasoning is for workloads that need explicit chain-of-thought: complex multi-step tasks, structured reasoning, and agentic pipelines.

For context on SLM economics broadly, see the small language models deployment guide.

Why a Small Reasoning Family Matters in 2026

Two years ago, getting useful chain-of-thought from a 14B model required careful prompt engineering and you still got brittle results. The quality ceiling for small reasoning models was around 7B, and it was notably lower than 70B class models. Ministral 3's 14B Reasoning variant changes that. The reasoning scratchpad gives you structured chain-of-thought output at a fraction of the cost of a 70B inference call.

The cost difference compounds fast. A single L40S at $1.07/hr on-demand (or $0.72/hr spot) can serve the 14B Reasoning variant for tasks where a $1.66/hr H100 SXM5 spot instance or a $4.50+/hr multi-GPU setup was previously required. For 10,000 requests per day at 1,000 tokens each, the gap between L40S spot and H100 spot is roughly $0.94/hr multiplied by 24 hours: roughly $22/day saved without touching model quality for most workloads.

The vision capability across all tiers also changes the architecture calculus. Instead of deploying a separate vision model for image queries and a text model for everything else, a single Ministral 3 8B checkpoint handles both. One deployment, one VRAM budget, one serving binary.

For a detailed breakdown of how inference costs scale with model size and request volume, see the AI inference cost economics guide.

GPU Sizing Per Ministral 3 Variant

VRAM requirements at BF16: multiply parameter count by 2 bytes per parameter. Add roughly 20% for framework overhead and KV cache at standard context lengths.

ModelPrecisionVRAM for WeightsKV Cache at 32KRecommended GPUNotes
Ministral 3 3BBF16~6 GB~4 GBRTX 4090 (24 GB)Abundant headroom; also runs on fractional L40S
Ministral 3 3BINT4 AWQ~2 GB~4 GBAny GPU with 8+ GBEdge-friendly; runs on smaller hardware
Ministral 3 8BBF16~16 GB~8 GBL40S 48GBFits single GPU with room for batch KV cache
Ministral 3 8BINT4 AWQ~5 GB~8 GBRTX 4090 (24 GB)Tight on KV cache at 128K context
Ministral 3 14BBF16~28 GB~12 GBL40S 48GB or H100 80GBSingle GPU recommended; 2x GPUs for longer context
Ministral 3 14BINT4 AWQ~8-10 GB~12 GBRTX 4090 or L40SAWQ drops quality slightly; validate for your task

For the 3B variant, an RTX 4090 on Spheron gives you 24 GB of VRAM which is about 4x the model's weight footprint, leaving plenty of room for concurrent requests. You can also run the 3B on a fractional GPU partition if you need to share the GPU across multiple services.

For the 8B and 14B in BF16, L40S GPU instances on Spheron are the cost-optimal choice: 48 GB VRAM at the lowest on-demand rates in the recommended GPU range. The L40S covers the 14B with about 8 GB headroom after weights for KV cache at 32K context.

The 14B Reasoning variant on a single H100 on Spheron gives you more KV cache headroom (80 GB total, 52 GB free after weights) which is worth it if you need full 128K context or large batch sizes.

Step-by-Step: Deploy Ministral 3 with vLLM on Spheron

Step 1: Provision a Spheron instance

Log in at app.spheron.ai and navigate to GPU Cloud. Select your target GPU based on the sizing table above.

For the 14B Reasoning variant, pick L40S PCIe or H100 SXM5. For the 3B or 8B, an RTX 4090 or L40S works. Use spot instances for development and batch inference workloads on most GPU SKUs. For L40S, spot is currently cheaper than on-demand on Spheron ($0.72/hr vs $1.07/hr), so prefer spot for batch and dev workloads on L40S. Deploy with the PyTorch 2.5 / CUDA 12.4 base image.

Mount persistent storage before downloading weights. Minimum sizes per variant:

  • Ministral 3 3B BF16: 10-15 GB
  • Ministral 3 8B BF16: 20-25 GB
  • Ministral 3 14B BF16: 50 GB minimum (add buffer for vLLM cache)

Step 2: Install vLLM and dependencies

bash
pip install "vllm>=0.8.4"
pip install huggingface_hub hf_transfer

export HF_HUB_ENABLE_HF_TRANSFER=1
export HF_TOKEN=your_hf_token_here

Step 3: Download model weights

Ministral 3 checkpoints may require license acceptance per variant. Before running huggingface-cli download, visit the relevant model card at huggingface.co/mistralai and accept the terms if the repository is gated.

bash
# Ministral 3 3B instruct
huggingface-cli download mistralai/Ministral-3-3B-Instruct-2512

# Ministral 3 8B instruct
huggingface-cli download mistralai/Ministral-3-8B-Instruct-2512

# Ministral 3 14B reasoning (recommended for complex workloads)
huggingface-cli download mistralai/Ministral-3-14B-Reasoning-2512

Step 4: Launch the vLLM server

Ministral 3 3B on RTX 4090 (24 GB VRAM):

bash
vllm serve mistralai/Ministral-3-3B-Instruct-2512 \
 --dtype bfloat16 \
 --max-model-len 65536 \
 --port 8000

The 65K context limit gives you generous KV cache headroom on the RTX 4090. The 3B model leaves ~18 GB free for cache.

Ministral 3 8B on L40S (48 GB VRAM):

bash
vllm serve mistralai/Ministral-3-8B-Instruct-2512 \
 --dtype bfloat16 \
 --max-model-len 65536 \
 --port 8000

Ministral 3 14B Reasoning on L40S (48 GB VRAM, single GPU):

bash
vllm serve mistralai/Ministral-3-14B-Reasoning-2512 \
 --dtype bfloat16 \
 --max-model-len 32768 \
 --reasoning-parser mistral \
 --port 8000

Ministral 3 14B Reasoning on 2x GPUs for longer context:

bash
vllm serve mistralai/Ministral-3-14B-Reasoning-2512 \
 --dtype bfloat16 \
 --max-model-len 65536 \
 --reasoning-parser mistral \
 --tensor-parallel-size 2 \
 --port 8000

Tensor parallelism across 2 GPUs doubles your VRAM budget and lets you push to 65K+ context or increase batch size significantly.

Step 5: Send requests

Reasoning request:

python
import openai

client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="token")

response = client.chat.completions.create(
 model="mistralai/Ministral-3-14B-Reasoning-2512",
 messages=[
 {"role": "user", "content": "Explain why merge sort is more efficient than bubble sort for large datasets."}
 ],
 max_tokens=1024,
)

print(response.choices[0].message.content)

Vision request (multimodal with image):

python
response = client.chat.completions.create(
 model="mistralai/Ministral-3-8B-Instruct-2512",
 messages=[
 {
 "role": "user",
 "content": [
 {
 "type": "image_url",
 "image_url": {"url": "https://example.com/chart.png"},
 },
 {"type": "text", "text": "Describe what this chart shows."},
 ],
 }
 ],
 max_tokens=512,
)

Deploying Ministral 3 Vision: Multimodal Requests

All Ministral 3 variants load a vision encoder alongside the language model. When you serve via vLLM, no extra flags are needed for multimodal support: vLLM detects the vision encoder from the model config and loads it automatically.

For image preprocessing, keep input images at 1024px max on the longer edge before sending. Larger images increase tokenization time and can saturate memory on smaller GPUs. You can resize in Python before encoding to base64:

python
from PIL import Image
import base64
import io

def prepare_image(path: str, max_size: int = 1024) -> str:
 img = Image.open(path)
 img.thumbnail((max_size, max_size))
 img = img.convert('RGB')
 buffer = io.BytesIO()
 img.save(buffer, format="JPEG", quality=85)
 return base64.b64encode(buffer.getvalue()).decode()

b64_image = prepare_image("your_image.jpg")

response = client.chat.completions.create(
 model="mistralai/Ministral-3-14B-Reasoning-2512",
 messages=[
 {
 "role": "user",
 "content": [
 {
 "type": "image_url",
 "image_url": {"url": f"data:image/jpeg;base64,{b64_image}"},
 },
 {"type": "text", "text": "What does this image show?"},
 ],
 }
 ],
 max_tokens=256,
)

Latency expectations: Image tokenization adds roughly 50-200ms per image depending on resolution and GPU. For the 14B on L40S at 32K context, expect first-token latency of 300-600ms with a single image. Batch multiple images only if your use case genuinely needs it: each additional image adds VRAM pressure proportional to its token count.

Quantization: AWQ for the 14B Reasoning Variant

AWQ INT4

AWQ drops the 14B's VRAM requirement from ~28 GB to ~8-10 GB, putting it on a single RTX 4090. Quality loss is minimal on standard instruction tasks but can be more noticeable on precise reasoning chains. Test against your specific workload before committing.

To quantize from BF16:

bash
pip install autoawq

python -c "
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'mistralai/Ministral-3-14B-Reasoning-2512'
quant_path = './ministral-3-14b-awq'

model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

quant_config = {'zero_point': True, 'q_group_size': 128, 'w_bit': 4, 'version': 'GEMM'}
model.quantize(tokenizer, quant_config=quant_config)

model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)
"

Then serve with vLLM:

bash
vllm serve ./ministral-3-14b-awq \
 --quantization awq \
 --dtype float16 \
 --max-model-len 32768 \
 --port 8000

For a comprehensive walkthrough of AWQ quantization across model families, see the AWQ quantization deployment guide.

Blackwell (B200/B300)

As of May 2026, no dedicated Blackwell-optimized checkpoint (MXFP4 or NVFP4) has been published for Ministral 3 14B. You can run BF16 on B200 or B300 instances today using the same vLLM command with --dtype bfloat16. The 192 GB HBM3e on B200 leaves ample room for large batch sizes and long context. Watch Mistral's HuggingFace page for quantized Blackwell variants when they are released.

Throughput and Cost: Ministral 3 vs Mistral Small 4 and Llama 4 Scout

ModelGPU ConfigTokens/sec (approx)On-demand $/hrSpot $/hrBest Use Case
Ministral 3 3BRTX 4090 (24 GB)1,200-1,800$0.53N/AEdge serving, simple queries, high throughput
Ministral 3 8BL40S 48GB600-900$1.07$0.72Balanced: instruction + vision, moderate reasoning
Ministral 3 14B ReasoningL40S 48GB350-500$1.07$0.72Complex reasoning, chain-of-thought, agents
Ministral 3 14B ReasoningH100 SXM5500-700$3.70$1.66Same model, more KV cache, higher concurrency
Mistral Small 42x H200 SXM5400-600~$8.72~$3.52All-in-one: vision + reasoning + code, 119B MoE
Llama 4 ScoutH100 SXM5500-750$3.70$1.66Apache license, Meta ecosystem

When does Ministral 3 14B beat Mistral Small 4? When you need lower GPU cost, simpler serving infrastructure, or strict single-GPU constraints. Mistral Small 4 is the right call when a single model needs to handle extreme task diversity without routing, or when you need 256K context windows. See the Mistral Small 4 deployment guide for multi-GPU setup details.

Llama 4 Scout covers use cases where the Apache 2.0 license matters for legal or organizational reasons, or where Meta's ecosystem (LlamaIndex, Meta AI tools) is already in place.

Live Spheron Pricing for Recommended GPU SKUs

GPUOn-demand $/hrSpot $/hrRecommended For
RTX 4090 PCIe$0.53N/AMinistral 3 3B, edge serving, high-throughput small models
L40S PCIe$1.07$0.72Ministral 3 8B and 14B (best cost-per-token in this range)
H100 SXM5$3.70$1.66Ministral 3 14B Reasoning with large KV cache or high concurrency
H200 SXM5$4.36$1.76Multi-model serving, very long context, future headroom
B200 SXM6$6.76$3.50Ministral 3 14B BF16, large-scale deployment, Blackwell performance

Pricing fluctuates based on GPU availability. The prices above are based on 12 May 2026 and may have changed. Check current GPU pricing → for live rates.

Edge-to-Cloud Routing Pattern

A practical deployment pattern for Ministral 3 pairs the 3B at the edge with the 14B Reasoning on cloud, with a lightweight router dispatching based on query complexity.

The routing logic:

  • Simple queries (short, factual, single-turn): Ministral 3 3B on a low-cost fractional GPU
  • Complex queries (multi-step, code, long context, or explicit reasoning requested): Ministral 3 14B Reasoning on Spheron cloud

For a complete LLM routing implementation with classification approaches and cost analysis, see the LLM inference router guide.

Here is a simplified Python example of the routing logic:

python
import openai

TIER1_URL = "http://edge-node:8000/v1" # Ministral 3 3B
TIER2_URL = "http://cloud-node:8000/v1" # Ministral 3 14B Reasoning

def classify_complexity(query: str) -> str:
 """Returns 'simple' or 'complex' based on query heuristics."""
 complex_signals = [
 len(query) > 400,
 any(kw in query.lower() for kw in ["step by step", "explain why", "compare", "analyze"]),
 query.count("?") > 2,
 ]
 return "complex" if any(complex_signals) else "simple"

def route_query(query: str) -> str:
 tier = classify_complexity(query)
 if tier == "simple":
 client = openai.OpenAI(base_url=TIER1_URL, api_key="token")
 model = "mistralai/Ministral-3-3B-Instruct-2512"
 else:
 client = openai.OpenAI(base_url=TIER2_URL, api_key="token")
 model = "mistralai/Ministral-3-14B-Reasoning-2512"

 response = client.chat.completions.create(
 model=model,
 messages=[{"role": "user", "content": query}],
 max_tokens=1024,
 )
 return response.choices[0].message.content

For production, replace the heuristic classifier with an embedding-based classifier (e.g., sentence-transformers/all-MiniLM-L6-v2) for better accuracy without adding significant latency. The LLM inference router guide covers embedding classifiers, NGINX proxy setup, and multi-tier monitoring in depth.

For use cases that span cloud and on-device deployment, see the hybrid cloud-edge AI inference guide.

Production Checklist

Before moving a Ministral 3 deployment to production, cover these areas:

  • Observability. vLLM exposes Prometheus metrics at /metrics by default. Track vllm:num_requests_running, vllm:gpu_cache_usage_perc, and vllm:e2e_request_latency_seconds. Alert on cache usage above 90% and p95 latency above your SLA threshold.
  • Guardrails. For user-facing deployments, add input/output content filtering before and after the model. A lightweight classifier (e.g., a fine-tuned BERT-class model) running on CPU is sufficient for most content policies and adds under 5ms per request.
  • Fine-tuning. The instruct variants support LoRA fine-tuning. For domain adaptation on the 8B, Unsloth is an efficient option: it reduces memory overhead by 60-70% compared to standard Hugging Face PEFT, letting you fine-tune on a single L40S in a few hours.
  • Structured output. For agentic or data extraction workloads, start vLLM with --guided-decoding-backend xgrammar (or outlines for older versions) and pass response_format or guided_json fields in each request to enforce output schemas. The reasoning variant handles JSON schema constraints well because the scratchpad phase can plan the structure before outputting.
  • Spot preemption handling. If using spot instances on Spheron for batch workloads, implement checkpoint saves after each request or batch segment. Store checkpoints on persistent storage, not ephemeral instance storage. The router's fallback chain should automatically requeue preempted requests.
  • Auto-scaling. For variable traffic, Spheron's per-second billing makes burst scaling practical. Keep a warm single-GPU Tier 1 (3B) instance running permanently and scale Tier 2/3 (14B) up during peak hours. Scale down during off-peak to save 70%+ on the 14B GPU cost.

Ministral 3's multi-SKU design means you can start with the 3B for low-cost inference and graduate to the 14B Reasoning variant as your workload grows, without rewriting your serving stack.

On-demand L40S → | On-demand H100 → | View all GPU pricing →

STEPS / 05

Quick Setup Guide

  1. Choose your Ministral 3 variant and GPU

    Select 3B for edge or fractional-GPU serving (RTX 4090 or fractional L40S). Select 8B for balanced cost and capability (L40S 48GB single GPU). Select 14B Reasoning for production chain-of-thought workloads (L40S 48GB or H100 80GB). All variants support image inputs.

  2. Provision a Spheron GPU instance

    Log in at app.spheron.ai, navigate to GPU Cloud, and select your target GPU. Use spot pricing for development and batch workloads. Enable persistent storage (at least 50 GB for 14B BF16 weights, 15-20 GB for 3B). Deploy with the PyTorch 2.5 / CUDA 12.4 base image.

  3. Install vLLM and download weights

    Run pip install 'vllm>=0.8.4'. Export your HuggingFace token: export HF_TOKEN=your_token. Enable fast downloads: export HF_HUB_ENABLE_HF_TRANSFER=1. Download the target checkpoint with huggingface-cli download mistralai/Ministral-3-14B-Reasoning-2512.

  4. Launch the vLLM server

    For the 14B reasoning variant on L40S: vllm serve mistralai/Ministral-3-14B-Reasoning-2512 --dtype bfloat16 --max-model-len 32768 --reasoning-parser mistral --port 8000. For the 3B on RTX 4090: use --dtype bfloat16 --max-model-len 65536 for more KV cache headroom. For multi-GPU 14B: add --tensor-parallel-size 2.

  5. Send a multimodal or reasoning request

    For reasoning: launch vLLM with --reasoning-parser mistral to enable reasoning trace parsing. For vision: add an image_url content block to the messages array alongside your text prompt. The API is OpenAI-compatible and works with any client that targets the OpenAI Chat Completions endpoint.

FAQ / 05

Frequently Asked Questions

Ministral 3 14B in BF16 requires approximately 28 GB of VRAM for weights. A single L40S 48GB or H100 80GB is the recommended production config, with plenty of headroom for KV cache at 32K context. For AWQ INT4, 8-10 GB fits the model (RTX 4090 or any L40S), though KV cache will be tighter. On Spheron, the L40S is the best cost-per-token option for the 14B reasoning variant.

Yes. Ministral 3 3B in BF16 needs roughly 6 GB of VRAM. An RTX 4090 (24 GB) can serve it with abundant KV cache headroom, or you can run it on a fractional GPU share via Spheron's MPS-partitioned instances. For edge deployments, a 4-bit quantized 3B checkpoint runs on hardware with as little as 4 GB of VRAM.

The reasoning variant generates internal chain-of-thought tokens before producing a final answer. When serving with vLLM, add --reasoning-parser mistral to parse the internal reasoning traces. The instruct variant skips this scratchpad entirely and returns answers immediately.

Yes. All three size tiers - 3B, 8B, and 14B - include multimodal image understanding. The multimodal request format uses the standard OpenAI vision message structure with image_url content blocks alongside text.

On a single L40S 48GB on-demand instance, the cost is around $1.07/hr. Spot rates are currently $0.72/hr for this SKU, making spot the cheaper option for batch and development workloads. For the 3B variant, costs drop further. Check /pricing/ for live rates.

Build what's next.

The most cost-effective platform for building, training, and scaling machine learning models-ready when you are.