Published 21st August 2025

How to deploy and self-host DeepSeek-V3.1 on Northflank

This guide shows you how to deploy and self-host DeepSeek-V3.1 on Northflank using our one-click template or by setting it up manually. The model runs with vLLM for high-throughput inference and includes an OpenAI-compatible endpoint plus a full Open WebUI interface.

DeepSeek-V3.1 supports both thinking and non-thinking chat modes and features a 128K context window, large enough to hold a 300-page book.

📌 TL;DR

DeepSeek-V3.1 is a 671B parameter Mixture-of-Experts model with 128K context, hybrid thinking modes, and improved reasoning speed.
Runs best on 8× NVIDIA H200 GPUs with vLLM.
Deploy / Self-host on Northflank in minutes with our one-click template or configure manually for full control.
Once deployed, you get a rate-limit-free, OpenAI-compatible API and a user-friendly web interface.

👉 Deploy DeepSeek-V3.1 (128K) on Northflank now

What is DeepSeek-V3.1?

DeepSeek-V3.1 is the latest upgrade in the DeepSeek family of large language models. It builds on V3 and R1 with better reasoning speed, hybrid inference modes, and agentic improvements.

Key details:

Architecture: Mixture-of-Experts (671B total parameters, ~37B active per token)
Context window: 128K tokens
Modes: Chat vs Think (toggleable in WebUI with “DeepThink” button)
Efficiency: FP8 UE8M0 optimizations for H200 and domestic chips
Inference: Faster than R1 and V3 in thinking mode, higher throughput in non-thinking mode

These improvements make DeepSeek-V3.1 one of the most capable open-weight LLMs available today.

Why DeepSeek-V3.1 matters

Hybrid inference: Choose between standard chat or reasoning-heavy “Think” mode.
Faster reasoning: V3.1-Think responds quicker than R1 and earlier DeepSeek releases.
Agent improvements: Stronger tool use and multi-step planning.
128K context: Enough space for large documents, codebases, or entire books.
Open weights: Can be run on your own infra with no API restrictions.

On Northflank, you can deploy it securely, scale on demand, and avoid rate limits.

How to deploy DeepSeek-V3.1 on Northflank

You have two options: one-click template or manual setup.

1️⃣ Option 1: One-click deploy

Create a Northflank account

Sign up and enable GPU regions.
Select the template

From the template catalog, click Deploy DeepSeek-V3.1 128K on 8×H200 Now.
Deploy stack
- Creates a vLLM service with a mounted volume for the 671B model.
- Deploys Open WebUI with persistent storage for user data.
Wait for load

The vLLM service downloads and shards the model across GPUs. First load takes ~45–60 minutes.
Open WebUI

Navigate to the assigned code.run domain.
Create your account and start interacting with DeepSeek-V3.1 in chat or think mode.

You’ll also get an OpenAI-compatible endpoint to use with any client library.

2️⃣ Option 2: Manual deployment

1. Create a GPU-enabled project

In Northflank dashboard → Create Project.
Name: deepseek-v31.
Region: select one with H200 GPUs.

2. Deploy vLLM service

Create a new Deployment service → deepseek-v31-vllm.
Source: External image
```
vllm/vllm-openai:deepseek
```
Runtime variable:
- OPENAI_API_KEY → generate 128-char random key.
Networking:
- Add port 8000, protocol HTTP, expose publicly.
Compute:
- 8 × NVIDIA H200 GPUs.
Advanced → command:
```
sleep 1d
```

3. Attach persistent storage

Add volume deepseek-models.
Size: 1TB.
Mount path: /root/.cache/huggingface.
Attach to vLLM service.

4. Download and serve model

In service shell:

export HF_HUB_ENABLE_HF_TRANSFER=1
pip install --upgrade transformers torch hf-transfer
vllm serve deepseek-ai/DeepSeek-V3.1 --tensor-parallel-size 8

To automate:

bash -c "export HF_HUB_ENABLE_HF_TRANSFER=1 && pip install --upgrade transformers torch hf-transfer && vllm serve deepseek-ai/DeepSeek-V3.1 --tensor-parallel-size 8"

5. Deploy Open WebUI

New service: deepseek-v31-webui.
Image:
```
ghcr.io/open-webui/open-webui:latest
```
Volume: persistent for sessions.
Port: 8080, expose publicly.
Env vars:
- OPENAI_API_BASE=https://<vllm-service>.code.run/v1
- OPENAI_API_KEY=<same key>

6. Test via API

Example (Python):

import os
from openai import OpenAI

client = OpenAI(
 api_key=os.environ["OPENAI_API_KEY"],
 base_url="https://<vllm-service>.code.run/v1",
)

resp = client.chat.completions.create(
 model="deepseek-ai/DeepSeek-V3.1",
 messages=[
 {"role": "user", "content": "Explain DeepSeek-V3.1's benefits"}
 ]
)

print(resp.choices[0].message)

Cost of deploying DeepSeek-V3.1

How much does it cost to self-host DeepSeek-V3.1?

Many teams choose to self host DeepSeek 3.1 for cost efficiency and data privacy. Northflank makes it easy to deploy or self-host DeepSeek-V3.1 without infrastructure headaches.

Running DeepSeek-V3.1 at production scale requires 8× H200 GPUs.

Northflank GPU pricing (as of August 2025):

H200: ~$3.20/hour per GPU
8×H200 = ~$25.60/hour

Token cost equivalent with vLLM optimizations:

Input: ~$0.10 per 1M tokens
Output: ~$2.20 per 1M tokens

You pay only for the GPUs and storage you run, no hidden charges.

This makes Northflank one of the most cost-efficient platforms for MoE inference at scale.

DeepSeek-V3.1 vs earlier versions

DeepSeek has iterated quickly, with each release pushing reasoning, speed, and usability forward.

DeepSeek-V3.1 vs DeepSeek-V3

Architecture: Both use a 671B Mixture-of-Experts design with ~37B active parameters per forward pass.
Context window: V3 had 64K tokens, while V3.1 doubles this to 128K tokens.
Performance: V3.1 runs more efficiently on H200 GPUs thanks to FP8 (UE8M0) optimizations.
Inference modes: V3 supported standard chat-style inference only. V3.1 introduces hybrid inference with both chat and think modes.
Reasoning: V3 was capable but slower at multi-step reasoning. V3.1 improves both speed and accuracy in reasoning-heavy tasks.

👉 Verdict: DeepSeek-V3.1 is a direct upgrade, more context, faster reasoning, and flexible inference modes.

DeepSeek-V3.1 vs DeepSeek-R1

Purpose: R1 was tuned specifically for chain-of-thought reasoning using reinforcement learning. V3.1 integrates those improvements into a general-purpose model.
Context window: R1 was limited to 64K. V3.1 expands this to 128K tokens.
Speed: R1 reasoning was accurate but often slower. V3.1’s “Think” mode is faster while maintaining quality.
Flexibility: R1 forced reasoning-heavy outputs. V3.1 gives you a toggle between fast chat and deep reasoning.
Agent performance: V3.1 shows stronger results on tool use and multi-step tasks compared to R1.

👉 Verdict: DeepSeek-V3.1 replaces R1 by offering reasoning at higher speed, with the option to switch back to standard inference.

🔗 Useful links

Deploy DeepSeek-V3.1 on Northflank
Deploy DeepSeek’s older versions in your own cloud
- GCP
- Azure
- AWS
Deploy Qwen3 on Northflank
Self-host gpt-oss on Northflank

Final thoughts

DeepSeek-V3.1 represents a leap forward in open-weight reasoning models: hybrid inference, faster chain-of-thought, and a 128K context.

On Northflank, you can run it securely, scale across H200 GPUs, and interact through an OpenAI-compatible API or a user-friendly WebUI, with no rate limits.

👉 Deploy DeepSeek-V3.1 on Northflank now

Share this article with your network

👁 Image

👁 Deborah Emeni

Deborah Emeni • 16th June 2026

What is AI infrastructure? Core components and how to structure your stack

AI infrastructure is the compute, storage, networking, and orchestration stack for AI models. Learn the core components and how to build the stack.

👁 Image

👁 Deborah Emeni

Deborah Emeni • 11th June 2026

How much does an NVIDIA RTX PRO 6000 GPU cost?

The NVIDIA RTX PRO 6000 costs $3.00/hour on Northflank, with GPU, CPU, RAM, and storage included. Compare RTX PRO 6000 pricing across providers and learn what workloads it runs best.

Also from the blog

URL: https://northflank.com/blog/deploy-self-host-deep-seek-v3-1-on-northflank