SmolLM3-3B-summarize-dpo-lora

A LoRA adapter for HuggingFaceTB/SmolLM3-3B, preference-aligned with Direct Preference Optimization (DPO) on top of a summarization-tuned SFT adapter. This is Unit 2 of a hands-on walk through the Hugging Face smol fine-tuning course.

Lineage

HuggingFaceTB/SmolLM3-3B-Base
 └─ U1: SFT (LoRA) on summarization → tuggspeedman-ai/SmolLM3-3B-summarize-sft-lora
 └─ U2: DPO (LoRA) on human preferences → THIS MODEL

DPO continues training the U1 SFT adapter ("LoRA-on-LoRA"): the SFT'd model is the reference policy π_ref, and the same adapter is optimized further on preference pairs. The published artifact is a single adapter that carries both the SFT and DPO updates — load it on the base model and you get the fully-aligned model.

Intended use

A small, instruction-following assistant that has been (1) tuned toward concise summarization, then (2) preference-aligned toward cleaner, better-formatted, less verbose responses. Useful as a compact local assistant and as a reference implementation of an SFT→DPO pipeline.

How to use

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B-Base", dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, "tuggspeedman-ai/SmolLM3-3B-summarize-dpo-lora")
tok = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM3-3B")

messages = [{"role": "user", "content": "Summarize: <your text here>"}]
inputs = tok.apply_chat_template(
 messages, add_generation_prompt=True, enable_thinking=False,
 return_tensors="pt", return_dict=True,
).to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Trained in /no_think mode — pass enable_thinking=False to the chat template for inference consistent with training.

Training data

HuggingFaceTB/smoltalk2, Preference/llama_3.1_tulu_3_8b_preference_mixture_no_think (shard 0). A general-purpose human/AI preference mixture (Allen AI's Tulu 3 recipe, re-judged with Llama 3.1), in /no_think mode to match the SFT model. 12,000 pairs were used (11,400 train / 600 eval), held out before training to guard against leakage.

This is general preference alignment rather than summarization-specific — it improves response quality and formatting broadly on top of the summarization-focused SFT step.

Training procedure

DPO via TRL DPOTrainer (v1.2). The U1 SFT adapter was loaded as a PeftModel; TRL automatically cloned it into a frozen reference adapter, so no separate reference model was held in memory.

Hyperparameter	Value
Method	DPO (sigmoid loss), LoRA-on-LoRA
β (beta)	0.1
Learning rate	1e-6, cosine, warmup ratio 0.1
Epochs	1 (1,425 steps)
Effective batch size	8 (1 × grad-accum 8)
Max sequence length	1024
LoRA	r=16, α=32, dropout=0.05, targets: q/k/v/o/gate/up/down
Precision	bf16, gradient checkpointing
Hardware	HF Jobs `a100-large`, ~2.35h

Results

Metrics over the run (train → end), evaluated on the held-out split:

Metric	Start	End
loss	0.698	0.591
rewards/accuracies (train)	0.34	0.65
rewards/margins (train)	−0.007	+0.409
eval_loss	—	0.567
eval rewards/accuracies	—	0.675
eval rewards/margins	—	0.472

rewards/margins turning solidly positive (chosen scored above rejected) and accuracies rising past the 0.5 random baseline are the signatures of successful preference learning.

Qualitative effect. Compared to the SFT baseline on held-out prompts, the DPO model produces shorter, cleaner, better-terminated responses — and in at least one case fixed a degenerate repetition loop the SFT model fell into. (See generations_before.json / generations_after.json in this repo.)

Limitations

~20% of preference pairs exceeded the 1024-token cap and had their completions truncated (truncation_mode="keep_start"), so very long responses contributed only partial signal.
Preference data is general-purpose, not summarization-specific; gains are in response quality/format rather than summarization accuracy per se.
Inherits the base model's and SFT data's biases and knowledge cutoff. Not safety-tuned for production use.

Model tree for tuggspeedman-ai/SmolLM3-3B-summarize-dpo-lora

Base model

HuggingFaceTB/SmolLM3-3B-Base

Finetuned

HuggingFaceTB/SmolLM3-3B

Adapter

(41)

this model

URL: https://huggingface.co/tuggspeedman-ai/SmolLM3-3B-summarize-dpo-lora

⇱ tuggspeedman-ai/SmolLM3-3B-summarize-dpo-lora · Hugging Face