Related: DeepSeek V4 Preview (what we knew before) · DeepSeek V3.2 Guide · VRAM Requirements · llama.cpp vs Ollama vs vLLM · Qwen 3.5 Setup Guide · Run 31B Models on a Laptop
Contents
- What actually dropped
- V4-Flash vs V4-Pro: the real tradeoff
- Can you actually run this locally?
- Early community reports
- Independent evaluations now in
- Where to use which
- How to try it today
- Bottom line
DeepSeek V4 preview went live the evening of April 23, 2026. Two MoE checkpoints, both MIT, both 1M context. r/LocalLLaMA has been in steady eruption since, Hacker News has multiple front-page threads, and Simon Willison has his pelican-on-a-bicycle post up. This is the time-sensitive read — here’s what’s real, what’s claimed, and what’s still waiting on independent testing.
👁 Image: DeepSeek V4 Pro and Flash logos side by side with parameter counts
What actually dropped
Four checkpoints on the Hub under deepseek-ai:
- DeepSeek-V4-Pro — 1.6T total, 49B active, instruct
- DeepSeek-V4-Pro-Base — same params, base (no instruction tuning)
- DeepSeek-V4-Flash — 284B total, 13B active, instruct
- DeepSeek-V4-Flash-Base — same params, base
Instruct checkpoints ship as FP4 for MoE experts and FP8 for everything else. Base checkpoints are FP8 throughout. Both sizes carry a 1M-token context window. License is MIT on every file. That last point matters — no “community license” nonsense, no usage restriction clauses. Download, run, ship, done.
Architecture is where DeepSeek earned the headlines. Per DeepSeek’s V4 technical report and the Hugging Face blog post, V4 keeps the MoE direction but replaces V3.2’s attention stack with a hybrid of Compressed Sparse Attention and Heavily Compressed Attention, alternating across 61 layers. KV cache is stored in FP8 with BF16 RoPE dimensions.
The efficiency numbers DeepSeek claims for this are substantial:
- V4-Pro: 27% of single-token FLOPs, 10% of KV cache vs V3.2 at 1M tokens
- V4-Flash: 10% of FLOPs, 7% of KV cache vs V3.2
If these hold up under independent testing, million-token context stops being a theoretical feature and starts being something you can actually serve. We’ll see.
Three reasoning modes ship on both models: Non-think (fast, no chain of thought), Think High (explicit <think> blocks), and Think Max (maximum reasoning, requires 384K+ context window allocated). Recommended sampling is temperature 1.0, top-p 1.0.
V4-Flash vs V4-Pro: the real tradeoff
| Spec | V4-Flash | V4-Pro |
|---|---|---|
| Total params | 284B | 1.6T |
| Active params | 13B | 49B |
| Context window | 1M tokens | 1M tokens |
| License | MIT | MIT |
| Weights format | FP4 (MoE) + FP8 | FP4 (MoE) + FP8 |
| FLOPs vs V3.2 @ 1M | ~10% | ~27% |
| KV cache vs V3.2 @ 1M | ~7% | ~10% |
| API input (cache miss) | $0.14/M | $0.435/M |
| API input (cache hit) | $0.0028/M | $0.003625/M |
| API output | $0.28/M | $0.87/M |
| Reasoning modes | Non-think, Think High, Think Max | same |
| Best for | Agents, tool calling, high-volume | Research, max-quality long context |
All pricing per DeepSeek’s official API pricing page, verified June 12, 2026. Two post-launch pricing changes have landed: cache-hit rates were reduced 10× on April 26 (Flash + Pro), and on May 22 DeepSeek announced V4-Pro pricing would be cut 75% permanently, taking effect when the launch promo period ended on May 31. The table reflects current rates. Pro launch reference prices were $1.74/M input, $3.48/M output, $0.0145/M cache hit — useful only as historical context now.
The gap between these two isn’t what you’d expect from “small vs large.” Flash is ~3.1× cheaper per output token at current rates (it was ~12× at V4 launch — Pro’s permanent 75% cut on May 31 narrowed the gap considerably, though Flash is still the cheaper option). Flash is also ~5.6× smaller on disk. And per DeepSeek’s technical report, Flash approaches Pro on reasoning-heavy tasks when Think High or Think Max is enabled. That claim deserves real independent verification before anyone bets a pipeline on it — but if it holds even partially, Flash is the model most teams should be testing first.
Pro is what it sounds like: the frontier-adjacent model. DeepSeek’s own numbers put V4-Pro-Base at 90.1% MMLU, 76.8% HumanEval, 92.6% GSM8K. On agent benchmarks V4-Pro-Max scores 80.6 on SWE-bench Verified (roughly parity with Claude Opus 4.6) and 67.9 on Terminal Bench 2.0 (behind GPT-5.4-xHigh at 75.1). Those are DeepSeek’s own reported numbers from the tech report, not independent evaluations — see the independent evaluations section below. One note on the Opus 4.6 / GPT-5.4 parity anchors: those are as of this article’s April 24 publication date. Since then Anthropic has shipped Opus 4.7 / Opus 4.8 / Claude Fable 5 and OpenAI has shipped GPT-5.5, so treat the Opus 4.6 / GPT-5.4 comparisons as point-in-time historical framing, not current-frontier claims.
Can you actually run this locally?
Honest answer: one of these, maybe. The other, no.
V4-Pro. 1.6T total params. Even at Q4, you’re looking at ~800GB just for weights, plus KV cache for however much context you load, plus activation memory. This is not a homelab story. It’s a workstation with a terabyte of fast RAM, a high-end GPU for hybrid offload, and patience. An 8-GPU H100 or H200 box can serve it comfortably. A pair of 5090s and a ThreadRipper cannot. No independent benchmarks on home hardware yet — if you see a tok/s number for V4-Pro on consumer gear in the next week, treat it skeptically until someone reproduces it.
V4-Flash. 284B total, 13B active. At FP4 weights plus FP8 everywhere else, the checkpoint is in the ~150GB range. That’s still not laptop territory, but it’s in range for serious homelabs: two RTX 6000 Ada at 48GB each, or a Mac Studio M3 Ultra 512GB with unified memory, or a Threadripper with enough DDR5 channels to feed llama.cpp-style hybrid offload. Community reports in the first 24 hours suggest it runs — we don’t have reproducible tok/s numbers from independent testers yet. Flash at 13B active is structurally the same scale of “hot path” as Mixtral 8x7B or GLM-4.5 — if you have a rig that runs those, Flash is at least in the conversation.
For comparison: a 1T-class model like Kimi K2 has previously required similar hardware to V4-Pro — dedicated server or deep-pocket workstation. V4-Pro isn’t materially harder than that class. V4-Flash is meaningfully easier to run than V4-Pro, but still harder than a 70B dense model at Q4.
Quantization status as of June 10. vLLM supports the native FP4/FP8 checkpoints out of the box — see the vLLM DeepSeek V4 blog post for deployment notes. For llama.cpp, the story is messier than expected six weeks in: upstream llama.cpp does not yet fully support DeepSeek V4’s architecture (tracked in Discussion #22376). Unsloth has published a safetensors mirror at unsloth/DeepSeek-V4-Flash — typically the staging ground for their Dynamic 2.0 GGUFs — but no Dynamic 2.0 GGUFs have shipped yet. Community GGUFs do exist on Hugging Face (teamblobfish/DeepSeek-V4-Flash-GGUF is the most-cited), and WIP llama.cpp branches exist at antirez/llama.cpp-deepseek-v4-flash and nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF. If you need a clean llama.cpp + GGUF path right now, you’re choosing one of those forks or accepting that you’re on the experimental track. The upstream-Unsloth-Dynamic-2.0 combination people were originally waiting for isn’t here yet.
Early community reports
Everything in this section is community anecdote from the first 24 hours. Treat accordingly.
Vibe Code Benchmark. Vals AI reported V4 “overwhelmingly” topped open-source models on the Vibe Code Benchmark, with a roughly 10× jump from V3.2. That’s a Vals AI claim, not an independent re-run — useful signal, not a settled number. For independent verdicts since, see the next section.
Flash as a Haiku / Gemma replacement. Multiple r/LocalLLaMA threads note Flash is fast enough and cheap enough via the API to stand in for Claude Haiku or Gemma 4 in tool-calling pipelines. One benefit people keep pointing out: Flash’s tool-call schema uses interleaved thinking that survives across tool boundaries, which matters if you’re running multi-step agents. This is the claim to test first if you’re running agent workloads on closed APIs today.
Ollama cloud availability. deepseek-v4-flash:cloud went live on Ollama’s cloud catalog within hours of release. r/ollama threads on April 24 report intermittent timeouts on long requests — possibly just load from the release spike. Local Ollama pulls for V4 weights are not available yet; the cloud tag proxies through Ollama’s hosted inference.
Simon Willison’s pelican test. Simon’s post has the pelican-on-a-bicycle output for both models. Short version: Flash drew a better bicycle than Pro did, and Pro’s pelican came out with “a VERY large body, only one wing” (Simon’s words). He positions V4-Pro as roughly 3 to 6 months behind the American frontier labs while costing a fraction of their API rates — a fair take at this stage, though one weekend’s results aren’t a verdict.
Independent evaluations now in
Two independent evaluations dropped after this article first published. Both are worth folding into your read on V4.
NIST CAISI evaluation (report released May 3, 2026). The first US government-grade independent evaluation of a Chinese open-weight model, focused specifically on V4-Pro. The CAISI report tested V4-Pro across 9 benchmarks in 5 domains — cybersecurity, software engineering, natural sciences, abstract reasoning, mathematics — including two held-out non-public benchmarks (ARC-AGI-2 semi-private + CAISI’s own PortBench, built specifically to resist contamination by models trained on public benchmarks). The methodology disclosure is one of the cleaner parts of this story; the benchmark choice is more contamination-resistant than most open evaluations.
The headline findings, straight:
- V4-Pro lags the US frontier by about 8 months — it performs similarly to GPT-5, released roughly 8 months earlier.
- Most capable PRC AI model CAISI has evaluated to date.
- Cost-efficient against US peers: V4-Pro is more cost-efficient than GPT-5.4 mini on 5 out of 7 benchmarks tested.
That’s the read. V4-Pro is a real frontier-class Chinese model, but it isn’t matching this week’s best US labs — and pretending otherwise would be misreading the methodology. The cost-efficiency angle is the part that should matter most to operators running real workloads, and it lines up with what FoodTruck Bench found independently.
FoodTruck Bench (May 2026). FoodTruck Bench is an independent 30-day agentic business simulation — AI models manage a virtual food truck using 34 tools with persistent memory across the run. It’s the closest available proxy for a real production agent workload. The team launched it in February 2026 and tested V4-Pro about ten weeks after GPT-5.2 went through the same harness.
Result: V4-Pro matched GPT-5.2 within 3% on the benchmark, at roughly 17× lower API cost. The pricing table above is the authoritative cite for the absolute dollar numbers; the FoodTruck cost ratio is what matters for the comparison. Two honest caveats worth keeping in mind: “matched, didn’t exceed” is the right framing — V4-Pro is in the same band as GPT-5.2 on this workload, not above it. And the benchmark hasn’t been peer-reviewed; the team publishes methodology and maintains a public leaderboard, but treat it as a strong independent signal rather than a settled industry standard.
Taken together, CAISI and FoodTruck describe the same model from two angles: not the absolute frontier, but credible frontier-class capability at a cost structure that changes the math for anyone running agents at volume. That matches the release-day read — the independent evidence backs the cost-efficiency thesis without elevating V4-Pro to “best model overall.”
Where to use which
V4-Flash. Pick this when:
- You’re running agents or tool-calling pipelines and currently paying Anthropic or OpenAI per token
- You need long-context document work (contracts, codebases, transcripts) and don’t want to chunk
- You were running Qwen 3.6 27B or Gemma 4 for cost reasons and want to try a bigger active-param footprint
- You want MIT-licensed open weights as a hedge against closed-API policy changes
V4-Pro. Pick this when:
- You’re running research-grade reasoning tasks at $0.435 / $0.87 per million tokens (post-May-31 permanent rate)
- You have a 1M-token use case that actually needs maximum quality — multi-document analysis, very long code review, full-book translation
- You already run Kimi K2 or similar 1T-class models and have the hardware to host them
- Budget isn’t the constraint; model quality is
Either. Pick V4 at all when you want open weights under a permissive license and don’t want your production workload depending on closed-API pricing or availability decisions.
How to try it today
DeepSeek’s own API. OpenAI-compatible, cheapest direct path. See the DeepSeek API docs. Register, get a key, switch your base URL. Works as a drop-in for most OpenAI SDK code.
Vercel AI Gateway. Hacker News commenters on the V4 release thread noted Vercel’s gateway has the cheapest effective rate for V4 via prompt caching — their cache layer survives longer than DeepSeek’s native cache window. Verify against your own traffic before switching.
Ollama cloud. ollama run deepseek-v4-flash:cloud if you have cloud access set up. Currently seeing some stability reports on long runs — fine for testing, check your error rates before putting it in front of users.
vLLM self-hosted. The vLLM recipe for V4-Flash is published. Native FP4/FP8 serving. You need the hardware described in the hardware section above — this is not a weekend project on a single 3090.
llama.cpp + GGUF. Still WIP at the upstream level (see the Quantization status note above). If you need llama.cpp compatibility today, you’re choosing between community GGUFs (teamblobfish/DeepSeek-V4-Flash-GGUF) and WIP forks (antirez/llama.cpp-deepseek-v4-flash, nsparks/DeepSeek-V4-Flash-FP4-FP8-GGUF) — both work, but neither is the upstream-Unsloth-Dynamic-2.0 combination people were originally expecting. For production agent pipelines that need stability, the API or vLLM is the lower-friction choice for now.
Bottom line
V4-Flash is the actual news. Frontier-adjacent capability, MIT license, Haiku-tier API pricing (with cache-hit rates now 10× lower than launch), and architecture that’s legitimately easier to serve at 1M context than anything in its class. If you’re running agentic workloads on Claude or GPT today, test Flash this week. The pricing alone justifies a half-day of effort.
V4-Pro is impressive but it’s not a consumer-hardware story. Research groups, enterprise teams, and well-funded homelabs only. If you’re running Kimi K2 locally, Pro is in the same league. If you’re running a 3090, it isn’t.
The independent benchmarks have now landed — see Independent evaluations now in above. NIST CAISI rates V4-Pro about 8 months behind the US frontier and the most capable PRC model evaluated; FoodTruck Bench has it matching GPT-5.2 within 3% at 17× lower cost. Both reads point the same direction: V4-Pro is not the best model overall, but the cost structure genuinely changes the math for cost-sensitive agent workloads at volume. Pull Flash, try your actual workload against it, and form your own read.
Last updated June 12, 2026 — V4-Pro pricing corrected to the post-May-31 permanent 75% cut ($0.435 input / $0.87 output / $0.003625 cache hit). The June 10 refresh added NIST CAISI and FoodTruck Bench independent evaluations, the April 26 cache-hit 10× reduction note, and an updated read on llama.cpp / GGUF status. Original publication: April 24, 2026.
Get notified when we publish new guides.
Subscribe — free, no spam