Qwen3-8B-Base SFT+DPO LoRA — UltraFeedback-zh
LoRA adapter (r=32, α=64) trained with DPO (β=0.1, sigmoid loss) on top of
the SFT-merged base. Reference policy = SFT-merged base (trl's default with
peft_config + ref_model=None).
Preference pairs from opencsg/UltraFeedback-chinese binarized variant
(7,600 train + 400 eval). Trained on RunPod B200 in 46 minutes.
⚠️ This adapter is calibrated for the SFT-merged base, not raw Qwen3-Base. Apply SFT first, merge, then load this DPO adapter on top — see loading example.
Project repo: https://github.com/tutucheng99/qwen3-sft-dpo-eval · Full eval writeup: https://github.com/tutucheng99/qwen3-sft-dpo-eval/blob/main/docs/REPORT.md
Quick load
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-8B-Base", dtype="bfloat16", device_map="cuda", trust_remote_code=True,
)
# 1) merge SFT into base
base = PeftModel.from_pretrained(base, "JeffCheng12138/qwen3-8b-sft-coig-cqia").merge_and_unload()
# 2) apply DPO LoRA on top
model = PeftModel.from_pretrained(base, "JeffCheng12138/qwen3-8b-dpo-ultrafeedback-zh")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B-Base", trust_remote_code=True)
Eval highlights
Evaluated against BASE and SFT on 40 hand-curated Chinese prompts (DeepSeek-chat judge with order-bias control + bootstrap 95% CI):
| Pair | DPO win rate | 95% CI |
|---|---|---|
| DPO vs SFT | 0.625 | [0.537, 0.713] (significant) |
| DPO vs BASE | 0.463 | [0.350, 0.562] (statistical tie) |
DPO recovered the SFT regression and reached parity with BASE on the judge. Reward margins climbed from 0.02 to 1.0 nats over training.
Known regression
Dimension attribution found DPO weakened refusal robustness under jailbreak framing by 0.27 nats — under fictional / research framing the model assigns substantially less probability to refusal text. Mitigation strategies for deployment are out of scope here, but flag as a known side-effect.
Training config
- LoRA:
r=32, α=64, dropout=0.05, all attention + MLP linear layers - DPO: β=0.1, loss_type=sigmoid, max_length=2048
- Optimizer: AdamW fused, lr
5e-6, cosine schedule, warmup 0.1 - 1 epoch, effective batch 16 (per-device 2 × accum 8), bf16, sdpa attention
- Reference model: SFT-merged base (peft + ref_model=None default)
- Downloads last month
- 3
Model tree for JeffCheng12138/qwen3-8b-dpo-ultrafeedback-zh
Base model
Qwen/Qwen3-8B-Base