Voozh

📚 More on this topic: DeepSeek V4 Flash vs Pro (current flagship) · DeepSeek Models Guide (V3/R1) · Best LLMs for Math & Reasoning · VRAM Requirements · Running 70B Models Locally

⚠️ DeepSeek V4 is now the current flagship — see our V4 guide. This guide documents V3.2 (the previous generation) and the R1-Distill models that remain the best local reasoning option. DeepSeek consolidated the API around V4 in April 2026 — V3.2 is no longer a separately callable model, and the deepseek-chat / deepseek-reasoner names retire on July 24.

DeepSeek V3.2 matched GPT-5 on benchmarks at a fraction of the API cost when it launched in February 2026. On GPQA Diamond (PhD-level science), it jumped from V3’s 59.1 to ~80. On AIME math, from 39.6 to 59.4. On LiveCodeBench, from 39.2 to 49.2. Those weren’t incremental improvements — they were generational. Then V4 shipped April 23 and took the flagship spot. The architecture and benchmark deltas below are the historical record of what V3.2 changed; the API guidance later in this article points at V4 Flash, V3.2’s successor.

The full V3.2 model was always a 685B-parameter MoE that needed 350GB+ even at Q4. You weren’t running it on a 3090 then, and DeepSeek doesn’t list it as a separately accessible model now — the legacy model names route to V4 Flash, with a hard cutoff on July 24.

The durable part of this story: the R1-Distill models — dense reasoning specialists distilled from DeepSeek-R1 — are genuinely excellent, fit on consumer hardware, and haven’t been superseded by V4. There’s no V4-Distill family yet. The 32B distill rivals o1-mini. The 14B is still the best reasoning model you can run on 12GB. The R1-Distill sections below are current; the V3.2 flagship sections are the previous-gen reference.

What Changed: V3 vs V3.2

Benchmark	V3	V3.2	Improvement
MMLU-Pro	75.9	~85	+9.1
GPQA Diamond (PhD-level science)	59.1	~80	+20.9
AIME (math competition)	39.6	59.4	+19.8
LiveCodeBench	39.2	49.2	+10.0

The architecture stayed the same — 685B total parameters, ~37B active per token, MoE with 256 routed experts plus 1 shared expert. What changed is training: more data, better RL alignment, and DeepSeek Sparse Attention (DSA) that drops context processing from O(L^2) to roughly O(L*k).

At launch V3.2 competed directly with GPT-5 on general benchmarks and with Claude on reasoning tasks. The high-compute variant (V3.2-Speciale) surpassed GPT-5 on several benchmarks at the time. Since then V4 has shipped (April 23) and OpenAI/Anthropic have moved past GPT-5 and Claude 3.5 — the V3.2 numbers above are the architectural-improvement record, not the current frontier.

The DeepSeek Model Lineup

Model	Type	Params	Active	Context	Best For
V3.2	MoE	685B	~37B	128K	Previous-gen flagship (Feb 2026)
R1	MoE + reasoning	685B	~37B	128K	Deep chain-of-thought reasoning
R1-Distill-7B	Dense	7B	7B	128K	Budget reasoning
R1-Distill-14B	Dense	14B	14B	128K	Sweet spot reasoning
R1-Distill-32B	Dense	32B	32B	128K	Near-full R1 quality
R1-Distill-70B	Dense	70B	70B	128K	Best distill, needs serious VRAM
Coder-V2	MoE	236B	21B	128K	Coding specialist

The R1-Distill models were created by generating 800,000 reasoning samples from the full R1 model and fine-tuning smaller open-source models (Qwen 2.5 and Llama 3 bases) on that data. They’re dense — not MoE — so VRAM requirements scale linearly with parameter count. All are MIT licensed. As of June 2026, DeepSeek has not released a V4-Distill family — the R1-Distills remain the official open-weight reasoning checkpoints from DeepSeek, and they’re still hosted on Ollama under the deepseek-r1 tag family.

VRAM Requirements

Full V3.2 (685B MoE)

Precision	VRAM	Hardware	Practical?
FP16	~1,543GB	Multi-node H200 cluster	Datacenter
INT8	~700GB	8-10x H100	Enterprise
Q4	~350-400GB	5-8x H100	Enterprise

Bottom line for the full model: Not happening on consumer hardware. Use the API.

R1-Distill Models (Dense — The Practical Ones)

Model	VRAM (Q4)	VRAM (Q8)	Ollama Command
R1-Distill-1.5B	~1.5GB	~2.5GB	`ollama run deepseek-r1:1.5b`
R1-Distill-7B	~5GB	~8GB	`ollama run deepseek-r1:7b`
R1-Distill-8B	~5.5GB	~9GB	`ollama run deepseek-r1:8b`
R1-Distill-14B	~9GB	~15GB	`ollama run deepseek-r1:14b`
R1-Distill-32B	~20GB	~34GB	`ollama run deepseek-r1:32b`
R1-Distill-70B	~40GB	~75GB	`ollama run deepseek-r1:70b`

→ Use our Planning Tool to check exact VRAM for your setup.

These are the models that matter for local AI users. Dense architecture means predictable VRAM usage — no MoE surprises.

Which Distill for Which Hardware

8GB VRAM (What can you run?)

Pick: R1-Distill-7B (~5GB at Q4)

Strong reasoning for its size. The chain-of-thought output from R1 distillation means it works through problems step-by-step, catching errors that standard 7B models miss. Leaves headroom for a decent context window.

Compared to Qwen 3.5 9B: Qwen 3.5 is more versatile (chat, coding, translation). The R1-Distill-7B is better specifically at reasoning and math. If reasoning is your priority, go DeepSeek. For general use, Qwen 3.5.

12GB VRAM (What can you run?)

Pick: R1-Distill-14B (~9GB at Q4)

Probably the best reasoning model at this VRAM tier. The jump from 7B to 14B is significant for complex reasoning — fewer hallucinated steps, better at multi-hop logic, stronger at math.

Compared to Qwen 3.5 9B at Q8: similar tradeoff to the 7B tier. Qwen 3.5 has better tool calling and broader general coverage. R1-Distill-14B has deeper reasoning from the R1 distillation. Both fit comfortably at 12GB — pick based on your primary use case.

24GB VRAM (What can you run?)

Pick: R1-Distill-32B (~20GB at Q4)

This model rivals o1-mini on reasoning benchmarks. It outperforms OpenAI-o1-mini across multiple evaluations. At 20GB, it fits on a single 3090 with room for ~8K context.

The updated R1-0528 version (distilled from Qwen3-8B architecture) is even stronger — it surpasses both Qwen3-8B and Qwen3-32B on AIME benchmarks.

48GB+ VRAM (Dual GPU)

Pick: R1-Distill-70B (~40GB at Q4)

Needs dual GPUs or a Mac with 64GB+ unified memory. The quality jump from 32B to 70B is real but the hardware requirement doubles. Worth it if you have the setup, but the 32B is the better value.

Using the DeepSeek API today (V4 Flash, not V3.2)

V3.2 is no longer a separately callable model. DeepSeek consolidated the API around V4 in April 2026. The legacy model strings deepseek-chat and deepseek-reasoner currently route to V4 Flash as a compatibility shim, but they’re being fully retired on July 24, 2026, at 15:59 UTC. After that date, calls using those names fail outright. If you have V3.2-era code in production, migrate to deepseek-v4-flash before July 24.

For new builds today, the API is V4 Flash (with V4 Pro as the higher-tier reasoning option — see our V4 Flash vs Pro guide for the full comparison).

Current V4 Flash pricing (the practical successor to V3.2 for cost-sensitive use):

	Price
Input tokens (cache miss)	$0.14/M
Output tokens	$0.28/M
Cache hit input	$0.0028/M

V4 Flash is now cheaper than V3.2 was at deprecation ($0.28 input / $0.42 output by the end), with a 1M-token context window and unified 384K max output across thinking and non-thinking modes. For most workloads that V3.2 used to handle, V4 Flash is the right choice today. V4 Pro ($0.435 input / $0.87 output — the standing rate since the launch promo became permanent May 31, 2026) is the higher-tier reasoning option for tasks where V4 Flash isn’t enough.

Current example (V4 Flash):

from openai import OpenAI
client = OpenAI(
 api_key="your-deepseek-key",
 base_url="https://api.deepseek.com"
)
response = client.chat.completions.create(
 model="deepseek-v4-flash", # current; deepseek-chat retires July 24, 2026
 messages=[{"role": "user", "content": "Explain group theory"}],
)

The API stays OpenAI-compatible — swap your base URL and API key and most existing code works unchanged. If you’re maintaining a V3.2 client, the only required code change is the model string.

V3.2 launch pricing (historical reference): $0.25 input / $0.38 output / $0.028 cache-hit per million tokens at the February 2026 launch — roughly 10× cheaper than GPT-4o at the time. Pricing drifted slightly upward by deprecation ($0.28 / $0.42) before consolidation into V4 Flash made V3.2 cheaper still as deepseek-v4-flash.

V3.2 vs the Competition (at V3.2’s launch, Feb 2026)

Flagship Tier — point-in-time, not current

Model	MMLU-Pro	GPQA Diamond	AIME	LiveCodeBench
DeepSeek V3.2	~85	~80	59.4	49.2
GPT-5	~85	~78	—	—
Claude 3.5 Sonnet	~84	~65	—	—
Qwen3-235B-A22B	—	—	85.7	70.7

At launch, V3.2 competed with GPT-5 on general knowledge and beat Claude on science reasoning. Qwen3-235B led on math (AIME) and coding (LiveCodeBench) but needed ~143GB to run.

Where the frontier sits now (June 2026): OpenAI has shipped GPT-5.5, Anthropic has shipped Opus 4.7 / Opus 4.8 / Claude Fable 5, and DeepSeek’s own V4 took the flagship spot in April. The numbers in the table above are V3.2’s launch-day comparators, preserved as the architectural record — they aren’t current-frontier claims. For where V4 sits, see our V4 Flash vs Pro guide.

Budget Local Tier (What You Can Actually Run Today)

Model	VRAM (Q4)	Reasoning	Coding	Chat
R1-Distill-14B	~9GB	Excellent	Good	Good
Qwen 3.5 9B at Q8	~10GB	Very good	Very good	Excellent
Llama 3.1 8B	~5GB	Decent	Decent	Good

At 12GB VRAM, R1-Distill-14B and Qwen 3.5 9B are both top-tier. The DeepSeek model wins on pure reasoning depth. Qwen 3.5 wins on versatility, tool calling, and broader general coverage.

The China Factor

DeepSeek is a Chinese company. Some users have concerns about data privacy. Here’s what matters:

If you run locally: Your data never leaves your machine. The model weights are open, the community has audited them, and inference happens entirely on your hardware. It doesn’t matter where the company is headquartered if the computation happens on your GPU.

If you use the API: Your prompts go to DeepSeek’s servers in China. If that’s a concern for your use case, run locally or use a different API provider. Several third-party providers serve DeepSeek models from US/EU infrastructure.

The license is real: MIT for the R1-Distill models. The weights are genuinely open. You can inspect, modify, and redistribute them.

Practical take: For local AI, the origin of the model doesn’t affect your privacy. That’s the whole point of running locally. If you’re paranoid, run the R1-Distills on your own hardware and your data goes nowhere.

Getting Started

Local (R1-Distill Models)

# Pick based on your VRAM
ollama run deepseek-r1:7b # 8GB VRAM
ollama run deepseek-r1:14b # 12GB VRAM — sweet spot
ollama run deepseek-r1:32b # 24GB VRAM — rivals o1-mini
ollama run deepseek-r1:70b # 48GB+ VRAM

Full flagship (API — V4 Flash, the V3.2 successor)

Sign up at platform.deepseek.com, grab an API key, and use the OpenAI-compatible endpoint with model="deepseek-v4-flash". At $0.14/M input tokens and $0.0028/M cache-hit, V4 Flash is cheaper than V3.2 ever was. For the higher-tier reasoning option, use deepseek-v4-pro — see our V4 Flash vs Pro guide for which to pick when.

Which Path?

Budget builder, single GPU: Run the R1-Distill that fits your VRAM. The 14B at 12GB is the reasoning sweet spot.
Need flagship API: Use V4 Flash today (deepseek-v4-flash). If you have V3.2-era code using deepseek-chat, migrate before July 24.
Privacy-first: Run distills locally. Your data stays on your machine.
Best overall quality at 24GB: Compare R1-Distill-32B vs Qwen 3.6 27B — both are excellent, different strengths.

The R1-Distill models prove that distillation from a frontier model can produce genuinely good small models, and DeepSeek hasn’t shipped a V4-Distill family yet — so R1-Distills remain the official open-weight reasoning checkpoints to run on consumer hardware. At $200 for a used 12GB GPU running the 14B distill, you get reasoning quality that didn’t exist at this price point eighteen months ago.

Get notified when we publish new guides.

Subscribe — free, no spam

URL: https://insiderllm.com/guides/deepseek-v3-2-guide/

⇱ DeepSeek V3.2 Guide: What Changed and How to Run It Locally | InsiderLLM