Qwen3-8B-Base SFT+DPO LoRA — UltraFeedback-zh

LoRA adapter (r=32, α=64) trained with DPO (β=0.1, sigmoid loss) on top of the SFT-merged base. Reference policy = SFT-merged base (trl's default with peft_config + ref_model=None).

Preference pairs from opencsg/UltraFeedback-chinese binarized variant (7,600 train + 400 eval). Trained on RunPod B200 in 46 minutes.

⚠️ This adapter is calibrated for the SFT-merged base, not raw Qwen3-Base. Apply SFT first, merge, then load this DPO adapter on top — see loading example.

Project repo: https://github.com/tutucheng99/qwen3-sft-dpo-eval · Full eval writeup: https://github.com/tutucheng99/qwen3-sft-dpo-eval/blob/main/docs/REPORT.md

Quick load

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
 "Qwen/Qwen3-8B-Base", dtype="bfloat16", device_map="cuda", trust_remote_code=True,
)
# 1) merge SFT into base
base = PeftModel.from_pretrained(base, "JeffCheng12138/qwen3-8b-sft-coig-cqia").merge_and_unload()
# 2) apply DPO LoRA on top
model = PeftModel.from_pretrained(base, "JeffCheng12138/qwen3-8b-dpo-ultrafeedback-zh")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B-Base", trust_remote_code=True)

Eval highlights

Evaluated against BASE and SFT on 40 hand-curated Chinese prompts (DeepSeek-chat judge with order-bias control + bootstrap 95% CI):

Pair	DPO win rate	95% CI
DPO vs SFT	0.625	[0.537, 0.713] (significant)
DPO vs BASE	0.463	[0.350, 0.562] (statistical tie)

DPO recovered the SFT regression and reached parity with BASE on the judge. Reward margins climbed from 0.02 to 1.0 nats over training.

Known regression

Dimension attribution found DPO weakened refusal robustness under jailbreak framing by 0.27 nats — under fictional / research framing the model assigns substantially less probability to refusal text. Mitigation strategies for deployment are out of scope here, but flag as a known side-effect.

Training config

LoRA: r=32, α=64, dropout=0.05, all attention + MLP linear layers
DPO: β=0.1, loss_type=sigmoid, max_length=2048
Optimizer: AdamW fused, lr 5e-6, cosine schedule, warmup 0.1
1 epoch, effective batch 16 (per-device 2 × accum 8), bf16, sdpa attention
Reference model: SFT-merged base (peft + ref_model=None default)

Downloads last month: 3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JeffCheng12138/qwen3-8b-dpo-ultrafeedback-zh

Base model

Qwen/Qwen3-8B-Base

Adapter

(73)

this model

URL: https://huggingface.co/JeffCheng12138/qwen3-8b-dpo-ultrafeedback-zh