Fine-Tune DeepSeek Models for Custom Use Cases

👁 Mark Harbottle

Mark Harbottle

Published in

AI·Programming·JavaScript·

March 29, 2026

Share this article

👁 Fine-Tune DeepSeek Models for Custom Use Cases

SitePoint Premium

Stay Relevant and Grow Your Career in Tech

Premium Results
Publish articles on SitePoint
Daily curated jobs
Learning Paths
Discounts to dev tools

Start Free Trial

7 Day Free Trial. Cancel Anytime.

How to Fine-Tune DeepSeek Models for Custom Use Cases

Select a distilled base model (e.g., DeepSeek-R1-Distill-Qwen-7B) that fits your GPU budget.
Prepare 500–5,000 labeled examples in JSONL ChatML format with train/validation splits.
Install pinned dependencies (transformers, peft, trl, bitsandbytes) and authenticate with Hugging Face.
Load the base model with 4-bit QLoRA quantization to fit within 16 GB VRAM.
Configure LoRA adapters (r=16, alpha=32) targeting attention projection layers.
Train using SFTTrainer with cosine scheduling, monitoring eval loss for overfitting.
Evaluate the fine-tuned adapter against a held-out test set, comparing base vs. adapted accuracy.
Deploy via TGI Docker and a Node.js API layer consumed by your frontend.

You will take a general-purpose language model, adapt it to a specific task with a few thousand labeled examples, and deploy the result behind an API your frontend can call. DeepSeek's open-weight models, particularly DeepSeek-R1 and DeepSeek-V3, ship under permissive licenses that allow commercial use. Using parameter-efficient fine-tuning techniques like LoRA and QLoRA, teams without massive GPU budgets can adapt these models to specialized tasks, typically seeing 10-30 percentage point accuracy gains with 500-5,000 examples.

This tutorial walks through the complete pipeline: preparing a dataset, configuring and executing a fine-tuning job on a DeepSeek model, evaluating results, and serving the fine-tuned model through a Node.js API consumed by a React frontend. The target task is customer support classification, though the approach generalizes to other domains.

Prerequisites: Python 3.10+, Node.js 18.18+ (for stable global fetch; alternatively install the node-fetch package on earlier 18.x versions), a Hugging Face account with a write-access token, and access to a GPU with at least 16GB VRAM (local or cloud). You must authenticate before pushing models or datasets: run huggingface-cli login before executing any push_to_hub calls. Docker with NVIDIA Container Toolkit is required for the serving section.

Understanding DeepSeek Models and Fine-Tuning Concepts

DeepSeek Model Architecture Overview

DeepSeek-V3 is a Mixture-of-Experts (MoE) model with 671 billion total parameters, of which approximately 37 billion activate per token. It excels at general-purpose generation and instruction following. DeepSeek-R1, also a 671B MoE model, DeepSeek specifically trained for extended chain-of-thought reasoning, making it stronger on tasks requiring multi-step logic, math, and code. For fine-tuning on consumer or single-cloud-GPU hardware, the distilled variants are the practical choice. DeepSeek-R1-Distill-Qwen-7B is a dense 7B parameter model distilled from R1's reasoning capabilities into a Qwen2 architecture, requiring significantly less VRAM while retaining R1's reasoning behavior on benchmarks like MATH-500 and AIME 2024 (see the DeepSeek-R1 technical report for specific scores).

The MoE architecture of the full-size models adds a routing layer that directs tokens to different expert subnetworks. Current PEFT libraries do not automate routing-aware adapter placement, so adapters must target specific expert projections manually. Distilled dense models sidestep this entirely, making them the recommended starting point.

Fine-Tuning Strategies: Full vs. Parameter-Efficient

Full fine-tuning updates every parameter in the model. For a 7B model, weights alone require approximately 28GB in float32 (7B × 4 bytes); with Adam optimizer states, total VRAM demand exceeds 84GB. Mixed-precision training halves the weight footprint but optimizer states remain large. Full fine-tuning is rarely practical outside well-funded labs.

LoRA (Low-Rank Adaptation) freezes the base model weights and injects small trainable low-rank matrices into specific layers, typically the attention projections. This reduces trainable parameters dramatically (for example, with r=16 targeting 4 attention matrices in a 7B model, trainable parameters are roughly 40M compared to 7B base parameters), dropping VRAM usage from roughly 84GB to 6-8GB for a 7B model under QLoRA. QLoRA extends this by loading the base model in 4-bit quantized form, further cutting VRAM so a 7B model fits on a single 16GB GPU for inference and training with small batch sizes. On 16GB GPUs, keep training batch size at 1-2 to avoid out-of-memory errors. For DeepSeek distilled models, QLoRA with LoRA adapters on attention layers is the recommended approach, keeping VRAM under 16GB while typically staying within 1-3 percentage points of full fine-tune accuracy on classification tasks.

QLoRA extends this by loading the base model in 4-bit quantized form, further cutting VRAM so a 7B model fits on a single 16GB GPU for inference and training with small batch sizes.

Preparing Your Dataset

Dataset Format and Structure

The training script expects JSONL format with each line containing a structured conversation. For compatibility with the SFTTrainer and ChatML templating, structure each entry as OpenAI-style chat messages. For LoRA fine-tuning, a minimum of 500 to 5,000 high-quality examples is a common heuristic, though the optimal number depends on task complexity. Quality matters more than quantity: inconsistent labels, contradictory examples, or noisy data will degrade the adapter more than a smaller, clean dataset.

{"messages": [{"role": "system", "content": "You are a customer support classifier. Classify the customer message into one of: billing, technical, account, general."}, {"role": "user", "content": "I was charged twice for my subscription this month."}, {"role": "assistant", "content": "billing"}]}
{"messages": [{"role": "system", "content": "You are a customer support classifier. Classify the customer message into one of: billing, technical, account, general."}, {"role": "user", "content": "My dashboard won't load after the latest update."}, {"role": "assistant", "content": "technical"}]}
{"messages": [{"role": "system", "content": "You are a customer support classifier. Classify the customer message into one of: billing, technical, account, general."}, {"role": "user", "content": "How do I reset my password?"}, {"role": "assistant", "content": "account"}]}
{"messages": [{"role": "system", "content": "You are a customer support classifier. Classify the customer message into one of: billing, technical, account, general."}, {"role": "user", "content": "What are your business hours?"}, {"role": "assistant", "content": "general"}]}

Cleaning and Validating Your Data

Before training, deduplicate the dataset, filter by length to remove outliers, and verify that every entry contains the required fields. A 90/10 train/validation split works well for datasets under 5,000 examples. Note that the character-count-divided-by-4 estimate below is unreliable for non-English or code-heavy content; use tokenizer.encode() for accurate token counts.

import json
import random
import os
from collections import Counter
def validate_dataset(filepath, val_ratio=0.1):
 entries = []
 errors = []
 token_lengths = []
 with open(filepath, "r", encoding="utf-8") as f:
 for i, line in enumerate(f):
 line = line.strip()
 if not line:
 continue
 try:
 entry = json.loads(line)
 messages = entry.get("messages", [])
 roles = [m.get("role") for m in messages]
 if "user" not in roles or "assistant" not in roles:
 errors.append(f"Line {i}: missing user or assistant role")
 continue
 total_len = sum(len(m.get("content", "")) for m in messages)
 token_lengths.append(total_len // 4) # rough token estimate
 entries.append(entry)
 except json.JSONDecodeError as exc:
 errors.append(f"Line {i}: invalid JSON — {exc}")
 print(f"Valid entries: {len(entries)}")
 print(f"Errors: {len(errors)}")
 if token_lengths:
 print(f"Avg estimated tokens: {sum(token_lengths) // len(token_lengths)}")
 else:
 print("No valid entries to estimate token length.")
 for e in errors[:5]:
 print(f" {e}")
 if not entries:
 raise ValueError("No valid entries found; aborting split to avoid empty dataset files.")
 random.seed(42)
 random.shuffle(entries)
 split = int(len(entries) * (1 - val_ratio))
 splits = [("train.jsonl", entries[:split]), ("val.jsonl", entries[split:])]
 for split_name, data in splits:
 try:
 with open(split_name, "w", encoding="utf-8") as out:
 for item in data:
 out.write(json.dumps(item, ensure_ascii=False) + "
")
 print(f"Wrote {len(data)} entries to {split_name}")
 except OSError as exc:
 raise RuntimeError(f"Failed to write {split_name}: {exc}") from exc
validate_dataset("dataset.jsonl")

Uploading to Hugging Face Hub

Once validated, push the dataset to the Hub for easy access during training:

# Run 'huggingface-cli login' before executing this block, or call login() programmatically.
from datasets import load_dataset
dataset = load_dataset("json", data_files={"train": "train.jsonl", "validation": "val.jsonl"})
dataset.push_to_hub("your-username/support-classifier-dataset", private=True)

Setting Up the Training Environment

Hardware Requirements and Cloud Options

The minimum viable setup for QLoRA fine-tuning of DeepSeek-R1-Distill-Qwen-7B is a single GPU with 16GB+ VRAM. An NVIDIA T4 (16GB) works but training will be slow; it requires fp16=True (T4 does not support bfloat16) and a small batch size (1-2). With an A10G (24GB) or RTX 4090 (24GB), you get bfloat16 support and can run batch size 4+, roughly halving training time compared to the T4. Cloud options include RunPod (A10G instances starting around $0.50/hr as of mid-2024; verify current rates at runpod.io), Lambda Labs, and Google Colab Pro (A100 access intermittently available). For the tutorial dataset of approximately 1,000 examples over 3 epochs, expect 1 to 3 hours of training on an A10G (actual time depends on sequence length, batch size, and model loading overhead), costing roughly $1.50 to $3.00 (excluding storage and egress fees).

Installing Dependencies

# requirements.txt
torch==2.1.2
transformers==4.44.2
peft==0.12.0
trl==0.9.6
bitsandbytes==0.43.1
datasets==2.20.0
accelerate==0.33.0
huggingface_hub
scipy

pip install -r requirements.txt

Pinning versions prevents breakage between bitsandbytes, transformers, and peft, which has historically been a pain point.

Configuring and Running the Fine-Tuning Job

Loading the Base Model with Quantization

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
bnb_config = BitsAndBytesConfig(
 load_in_4bit=True,
 bnb_4bit_quant_type="nf4",
 bnb_4bit_compute_dtype=torch.bfloat16, # Use torch.float16 on T4 or other non-Ampere GPUs
 bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
 model_id,
 quantization_config=bnb_config,
 device_map="auto",
 trust_remote_code=True, # Review the model repo's code files before enabling in production
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # Note: using EOS as pad token can cause premature stopping in some cases. An alternative is tokenizer.add_special_tokens({'pad_token': '[PAD]'}) followed by model.resize_token_embeddings(len(tokenizer)).
tokenizer.padding_side = "right"

The nf4 quantization type is specifically designed for normally distributed weights (see Dettmers et al., "QLoRA: Efficient Finetuning of Quantized LLMs," 2023) and pairs well with bfloat16 compute on Ampere+ GPUs. Double quantization further reduces memory overhead by quantizing the quantization constants themselves, saving approximately 0.37 bits per parameter.

Defining LoRA Configuration

from peft import LoraConfig, TaskType
lora_config = LoraConfig(
 r=16, # rank: higher = more capacity, more VRAM
 lora_alpha=32, # scaling factor for adapter output magnitude; higher alpha/r amplifies adapter contributions
 target_modules=[
 "q_proj", "k_proj", # attention query and key projections
 "v_proj", "o_proj", # attention value and output projections
 ],
 lora_dropout=0.05, # regularization to prevent overfitting
 bias="none", # don't train bias terms
 task_type=TaskType.CAUSAL_LM, # causal language modeling objective
)

The ratio of lora_alpha to r (here 32/16 = 2) controls the effective magnitude of the adapter updates. Targeting all four attention projection matrices provides good coverage without touching the MLP layers, keeping trainable parameter count low.

Training Arguments and SFTTrainer Setup

from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
dataset = load_dataset("your-username/support-classifier-dataset")
training_args = TrainingArguments(
 output_dir="./deepseek-finetuned",
 num_train_epochs=3,
 per_device_train_batch_size=2, # Reduce to 1 on 16GB VRAM GPUs; increase gradient_accumulation_steps proportionally.
 gradient_accumulation_steps=8, # effective batch size = 16
 learning_rate=2e-4,
 warmup_steps=50,
 logging_steps=10,
 evaluation_strategy="steps", # Use evaluation_strategy for transformers==4.44.2
 eval_steps=100,
 save_strategy="steps",
 save_steps=100,
 bf16=True, # Requires Ampere+ GPU (A10G, A100, RTX 3090+). For T4, use fp16=True instead.
 optim="paged_adamw_8bit", # memory-efficient optimizer
 lr_scheduler_type="cosine",
 report_to="none",
)
# SFTTrainer passes a batch dict; formatting_func must handle a list of message-lists
def format_batch(batch):
 return [
 tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)
 for msgs in batch["messages"]
 ]
trainer = SFTTrainer(
 model=model,
 args=training_args,
 train_dataset=dataset["train"],
 eval_dataset=dataset["validation"],
 tokenizer=tokenizer,
 peft_config=lora_config,
 max_seq_length=512,
 formatting_func=format_batch, # returns list[str], one per example in batch
)

The paged_adamw_8bit optimizer offloads optimizer states to CPU when GPU memory is tight, preventing OOM errors during backpropagation. Cosine learning rate scheduling provides a smooth decay that works well with short training runs.

Launching Training and Monitoring

# Run 'huggingface-cli login' before executing push_to_hub calls below.
trainer.train()
# Save the LoRA adapter only (not merged). To create a standalone model, call model.merge_and_unload() before saving.
model.save_pretrained("./deepseek-finetuned/final")
tokenizer.save_pretrained("./deepseek-finetuned/final")
# Optional: push adapter to Hugging Face Hub
model.push_to_hub("your-username/deepseek-support-classifier", private=True)
tokenizer.push_to_hub("your-username/deepseek-support-classifier", private=True)

During training, watch for eval loss that decreases alongside training loss. If training loss drops but eval loss plateaus or rises after epoch 1, overfitting is occurring: reduce epochs, increase dropout, or add more data. OOM errors mid-training typically respond to reducing per_device_train_batch_size to 1 or lowering max_seq_length.

Evaluating Your Fine-Tuned Model

Qualitative Testing with Sample Prompts

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
device = "cuda" if torch.cuda.is_available() else "cpu"
bnb_config = BitsAndBytesConfig(
 load_in_4bit=True,
 bnb_4bit_quant_type="nf4",
 bnb_4bit_compute_dtype=torch.bfloat16,
 bnb_4bit_use_double_quant=True,
)
base_model = AutoModelForCausalLM.from_pretrained(
 model_id,
 quantization_config=bnb_config,
 device_map="auto",
 trust_remote_code=True, # Review the model repo's code files before enabling in production
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
finetuned_model = PeftModel.from_pretrained(base_model, "./deepseek-finetuned/final")
finetuned_model.eval()
test_prompts = [
 "I need a refund for the double charge on my card.",
 "The API returns a 500 error when I upload files larger than 10MB.",
 "Can I transfer my license to a colleague?",
]
system_msg = "You are a customer support classifier. Classify the customer message into one of: billing, technical, account, general."
for prompt in test_prompts:
 messages = [{"role": "system", "content": system_msg}, {"role": "user", "content": prompt}]
 chat_text = tokenizer.apply_chat_template(
 messages, tokenize=False, add_generation_prompt=True
 )
 inputs = tokenizer(chat_text, return_tensors="pt").to(device)
 input_len = inputs["input_ids"].shape[1]
 with torch.no_grad():
 # Use disable_adapter() for base-model inference, since PeftModel wraps the base in-place
 with finetuned_model.disable_adapter():
 base_out = finetuned_model.generate(
 **inputs, max_new_tokens=20, pad_token_id=tokenizer.eos_token_id
 )
 ft_out = finetuned_model.generate(
 **inputs, max_new_tokens=20, pad_token_id=tokenizer.eos_token_id
 )
 print(f"Prompt: {prompt}")
 print(f" Base model: {tokenizer.decode(base_out[0][input_len:], skip_special_tokens=True)}")
 print(f" Fine-tuned model: {tokenizer.decode(ft_out[0][input_len:], skip_special_tokens=True)}")
 print()

Before/After Performance Metrics

Quantitative evaluation should use metrics appropriate to the task. For classification, accuracy on a held-out test set is the primary metric. For generation tasks, BLEU or ROUGE scores apply. The following table shows illustrative results for the customer support classification task; measure your own held-out set to get numbers you can trust.

Illustrative Example Results (Not a Benchmark)

Metric	Base DeepSeek Model	Fine-Tuned Model	Improvement
Task Accuracy	62%	91%	+29 percentage points
Response Relevance (human eval)	3.1/5	4.6/5	+48%
Avg. Inference Latency	1.2s	1.2s	No change
Hallucination Rate	18%	4%	-78%

Inference latency remains unchanged when the adapter is merged into base weights before serving, because the merged model has the same architecture as the original. When using PEFT inference with an unmerged adapter, a small overhead may exist. The accuracy improvement from 62% to 91% reflects the base model's tendency to generate verbose explanations rather than clean single-label outputs, a behavior that fine-tuning corrects.

The accuracy improvement from 62% to 91% reflects the base model's tendency to generate verbose explanations rather than clean single-label outputs, a behavior that fine-tuning corrects.

Serving the Fine-Tuned Model via Node.js API

Model Serving Options

For self-hosted inference, Text Generation Inference (TGI) by Hugging Face and vLLM are the two primary options. TGI provides a Docker image with built-in support for PEFT adapters, streaming, and batching. Verify adapter format compatibility with the specific TGI version used. Hugging Face Inference Endpoints offer a managed alternative. This tutorial uses TGI with Docker. Pin the TGI image tag to a specific version for reproducibility, and verify supported --quantize values by running the image with --help:

# Pin the TGI image tag (e.g., :2.0.4) for reproducibility.
# Verify accepted --quantize values: docker run ghcr.io/huggingface/text-generation-inference:2.0.4 --help
# The adapter-id must be a Hub model ID (push your adapter to Hub first).
docker run --gpus all -p 8080:80 \
 ghcr.io/huggingface/text-generation-inference:2.0.4 \
 --model-id deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \
 --adapter-id your-username/deepseek-support-classifier

Note: The --quantize flag accepts specific values that vary by TGI version. Run the image with --help to confirm accepted values before adding quantization. If you need 4-bit inference, consider serving a pre-quantized model or verifying that your TGI version supports --quantize bitsandbytes-nf4.

Building the Node.js API Layer

The following server uses TGI's OpenAI-compatible /v1/chat/completions endpoint, which allows TGI to apply the correct chat template server-side. This avoids the pitfall of manually formatting prompts on the client.

const express = require("express");
const rateLimit = require("express-rate-limit");
const cors = require("cors");
const app = express();
app.use(express.json());
const ALLOWED_ORIGIN = process.env.ALLOWED_ORIGIN || "http://localhost:3000";
app.use(cors({
 origin: (origin, callback) => {
 // Allow requests with no origin (server-to-server, curl) only in dev
 if (!origin || origin === ALLOWED_ORIGIN) {
 callback(null, true);
 } else {
 callback(new Error(`CORS: origin ${origin} not allowed`));
 }
 },
 methods: ["POST"],
 allowedHeaders: ["Content-Type"],
}));
const limiter = rateLimit({
 windowMs: 60 * 1000,
 max: 30,
 message: { error: "Rate limit exceeded. Try again shortly." },
});
app.use("/api/", limiter);
const TGI_URL = process.env.TGI_URL || "http://localhost:8080/v1/chat/completions";
const INFERENCE_TIMEOUT_MS = parseInt(process.env.INFERENCE_TIMEOUT_MS || "60000", 10);
app.post("/api/generate", async (req, res) => {
 const { prompt } = req.body;
 if (!prompt || typeof prompt !== "string" || prompt.length > 5000) {
 return res.status(400).json({ error: "A valid 'prompt' string is required (max 5000 chars)." });
 }
 const controller = new AbortController();
 const timeoutId = setTimeout(() => controller.abort(), INFERENCE_TIMEOUT_MS);
 try {
 const response = await fetch(TGI_URL, {
 method: "POST",
 headers: { "Content-Type": "application/json" },
 body: JSON.stringify({
 model: "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
 messages: [
 { role: "system", content: "You are a customer support classifier. Classify the customer message into one of: billing, technical, account, general." },
 { role: "user", content: prompt },
 ],
 max_tokens: 100,
 temperature: 0.1,
 }),
 signal: controller.signal,
 });
 if (!response.ok) {
 const errText = await response.text();
 return res.status(502).json({ error: "Model server error", details: errText });
 }
 const data = await response.json();
 const result = data.choices?.[0]?.message?.content ?? "";
 res.json({ result });
 } catch (err) {
 if (err.name === "AbortError") {
 return res.status(504).json({ error: "Inference timed out. Try a shorter prompt." });
 }
 res.status(500).json({ error: "Failed to reach model server", details: err.message });
 } finally {
 clearTimeout(timeoutId);
 }
});
app.listen(3001, () => console.log("API server running on port 3001"));

Node.js version note: The global fetch API is stable in Node.js 18.18+. On earlier 18.x versions, install node-fetch and add const fetch = require('node-fetch'); at the top of the file.

Building a React Frontend for Interaction

Proxy configuration: In development, the React dev server (typically port 3000) must proxy API requests to the Node.js server (port 3001). If using Create React App, add "proxy": "http://localhost:3001" to your package.json. If using Vite, configure server.proxy in vite.config.js. Without this, requests to /api/generate will fail with a CORS error or 404.

import { useState, useRef } from "react";
function SupportClassifier() {
 const [prompt, setPrompt] = useState("");
 const [result, setResult] = useState(null);
 const [loading, setLoading] = useState(false);
 const [error, setError] = useState(null);
 const abortRef = useRef(null);
 const handleSubmit = async (e) => {
 e.preventDefault();
 if (prompt.length > 5000) {
 setError("Message must be 5000 characters or fewer.");
 return;
 }
 // Cancel any in-flight request
 if (abortRef.current) abortRef.current.abort();
 const controller = new AbortController();
 abortRef.current = controller;
 setLoading(true);
 setError(null);
 setResult(null);
 try {
 const res = await fetch("/api/generate", {
 method: "POST",
 headers: { "Content-Type": "application/json" },
 body: JSON.stringify({ prompt }),
 signal: controller.signal,
 });
 if (!res.ok) throw new Error(`Server responded with ${res.status}`);
 const data = await res.json();
 setResult(data.result);
 } catch (err) {
 if (err.name !== "AbortError") setError(err.message);
 } finally {
 setLoading(false);
 }
 };
 return (
 <div style={{ maxWidth: 600, margin: "2rem auto", fontFamily: "system-ui" }}><h2>Customer Support Classifier</h2><form onSubmit={handleSubmit}><textarea
 rows={4}
 value={prompt}
 onChange={(e) => setPrompt(e.target.value)}
 placeholder="Enter a customer message to classify..."
 style={{ width: "100%", padding: "0.5rem", fontSize: "1rem" }}
 maxLength={5000}
 /><div style={{ fontSize: "0.8rem", color: "#666" }}>{prompt.length}/5000</div><button
 type="submit"
 disabled={loading || !prompt.trim()}
 style={{ marginTop: "0.5rem", padding: "0.5rem 1rem" }}
 >{loading ? "Classifying..." : "Classify"}</button></form>{result && (
 <div style={{ marginTop: "1rem", padding: "1rem", background: "#f0f7ff", borderRadius: 8 }}><strong>Classification:</strong>{result}</div>
 )}{error && (
 <div style={{ marginTop: "1rem", padding: "1rem", background: "#fff0f0", borderRadius: 8, color: "#c00" }}> Error: {error}</div>
 )}</div>
 );
}
export default SupportClassifier;

Implementation Checklist and Best Practices

Data Preparation

Choose base model (DeepSeek-R1-Distill-Qwen-7B recommended for most use cases)
Prepare dataset: minimum 500 examples in JSONL/ChatML format
Validate dataset: check fields, token lengths, deduplication
Create train/validation split

Training Setup

Set up GPU environment (16GB+ VRAM; 24GB recommended for training)
Install pinned dependencies
Authenticate with Hugging Face Hub (huggingface-cli login)
Configure quantization (4-bit QLoRA with nf4)
Set LoRA hyperparameters (r=16, alpha=32)
Configure training (lr=2e-4, 3 epochs, eval every 100 steps; use fp16 on T4)
Run training, monitor loss convergence

Evaluation

Save LoRA adapter weights (optionally merge with base using model.merge_and_unload() before saving for single-file deployment)
Run qualitative evaluation on held-out examples
Compare base vs. fine-tuned metrics

Deployment

Deploy model with TGI (pinned version) or Inference Endpoints
Build Node.js API layer
Build React frontend with proxy configuration
Test end-to-end pipeline
Document model card and dataset provenance

Common Pitfalls to Avoid

Overfitting on small datasets is the most frequent problem. With fewer than 1,000 examples, set num_train_epochs to 1 or 2 and monitor validation loss closely. If eval loss rises while training loss continues to fall, stop early.

A wrong chat template causes garbled or repetitive outputs. DeepSeek distilled models based on Qwen2 use the ChatML template. Verify with: print(tokenizer.chat_template) after loading the tokenizer. Always use tokenizer.apply_chat_template() rather than manually formatting prompts; this applies server-side in your API layer, not in the frontend.

Ignoring licensing terms creates legal risk. DeepSeek-R1 and its distilled variants ship under the MIT license, which is permissive for commercial use. However, derivative models may carry additional obligations depending on the training data used. Check the license file on the model's Hugging Face repository.

Never skip evaluation before deployment. Running the fine-tuned model against a held-out test set catches hallucination and misclassification rates before users hit them.

Where to Go from Here

This tutorial covered the full path from raw labeled data to a deployed fine-tuned DeepSeek model accessible through a React application. The same pipeline applies to larger base models when hardware permits, multi-task fine-tuning by mixing datasets from different domains, and preference tuning via DPO (which does not require a separate reward model) or RLHF (which does). A good next step: swap the classification dataset for a conversational QA dataset and re-run the pipeline with max_seq_length=1024 to see how the model handles longer outputs. For further reference, consult the DeepSeek model cards on Hugging Face, the PEFT library documentation, and the TRL library guides for advanced trainer configurations.

👁 Mark Harbottle
Mark Harbottle

Mark Harbottle is the co-founder of SitePoint, 99designs, and Flippa.

SitePoint Premium

Stay Relevant and Grow Your Career in Tech

Premium Results
Publish articles on SitePoint
Daily curated jobs
Learning Paths
Discounts to dev tools

Start Free Trial

7 Day Free Trial. Cancel Anytime.

URL: https://www.sitepoint.com/finetune-deepseek-models-for-custom-use-cases/