SmolLM3-3B-summarize-dpo-lora
A LoRA adapter for HuggingFaceTB/SmolLM3-3B, preference-aligned with Direct Preference Optimization (DPO) on top of a summarization-tuned SFT adapter. This is Unit 2 of a hands-on walk through the Hugging Face smol fine-tuning course.
Lineage
HuggingFaceTB/SmolLM3-3B-Base
└─ U1: SFT (LoRA) on summarization → tuggspeedman-ai/SmolLM3-3B-summarize-sft-lora
└─ U2: DPO (LoRA) on human preferences → THIS MODEL
DPO continues training the U1 SFT adapter ("LoRA-on-LoRA"): the SFT'd model is the reference policy π_ref, and the same adapter is optimized further on preference pairs. The published artifact is a single adapter that carries both the SFT and DPO updates — load it on the base model and you get the fully-aligned model.
Intended use
A small, instruction-following assistant that has been (1) tuned toward concise summarization, then (2) preference-aligned toward cleaner, better-formatted, less verbose responses. Useful as a compact local assistant and as a reference implementation of an SFT→DPO pipeline.
How to use
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B-Base", dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, "tuggspeedman-ai/SmolLM3-3B-summarize-dpo-lora")
tok = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")
messages = [{"role": "user", "content": "Summarize: <your text here>"}]
inputs = tok.apply_chat_template(
messages, add_generation_prompt=True, enable_thinking=False,
return_tensors="pt", return_dict=True,
).to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Trained in
/no_thinkmode — passenable_thinking=Falseto the chat template for inference consistent with training.
Training data
HuggingFaceTB/smoltalk2, Preference/llama_3.1_tulu_3_8b_preference_mixture_no_think (shard 0). A general-purpose human/AI preference mixture (Allen AI's Tulu 3 recipe, re-judged with Llama 3.1), in /no_think mode to match the SFT model. 12,000 pairs were used (11,400 train / 600 eval), held out before training to guard against leakage.
This is general preference alignment rather than summarization-specific — it improves response quality and formatting broadly on top of the summarization-focused SFT step.
Training procedure
DPO via TRL DPOTrainer (v1.2). The U1 SFT adapter was loaded as a PeftModel; TRL automatically cloned it into a frozen reference adapter, so no separate reference model was held in memory.
| Hyperparameter | Value |
|---|---|
| Method | DPO (sigmoid loss), LoRA-on-LoRA |
| β (beta) | 0.1 |
| Learning rate | 1e-6, cosine, warmup ratio 0.1 |
| Epochs | 1 (1,425 steps) |
| Effective batch size | 8 (1 × grad-accum 8) |
| Max sequence length | 1024 |
| LoRA | r=16, α=32, dropout=0.05, targets: q/k/v/o/gate/up/down |
| Precision | bf16, gradient checkpointing |
| Hardware | HF Jobs a100-large, ~2.35h |
Results
Metrics over the run (train → end), evaluated on the held-out split:
| Metric | Start | End |
|---|---|---|
| loss | 0.698 | 0.591 |
| rewards/accuracies (train) | 0.34 | 0.65 |
| rewards/margins (train) | −0.007 | +0.409 |
| eval_loss | — | 0.567 |
| eval rewards/accuracies | — | 0.675 |
| eval rewards/margins | — | 0.472 |
rewards/margins turning solidly positive (chosen scored above rejected) and accuracies rising past the 0.5 random baseline are the signatures of successful preference learning.
Qualitative effect. Compared to the SFT baseline on held-out prompts, the DPO model produces shorter, cleaner, better-terminated responses — and in at least one case fixed a degenerate repetition loop the SFT model fell into. (See generations_before.json / generations_after.json in this repo.)
Limitations
- ~20% of preference pairs exceeded the 1024-token cap and had their completions truncated (
truncation_mode="keep_start"), so very long responses contributed only partial signal. - Preference data is general-purpose, not summarization-specific; gains are in response quality/format rather than summarization accuracy per se.
- Inherits the base model's and SFT data's biases and knowledge cutoff. Not safety-tuned for production use.
Links
- Code: github.com/tuggspeedman-ai/hf-smol-course — see
notebooks/unit2/exercise2_dpo_lora.py - SFT predecessor (U1): tuggspeedman-ai/SmolLM3-3B-summarize-sft-lora
- Course: HF smol fine-tuning course
- Downloads last month
- 18
