VOOZH about

URL: https://huggingface.co/JeffCheng12138/qwen3-8b-dpo-ultrafeedback-zh

⇱ JeffCheng12138/qwen3-8b-dpo-ultrafeedback-zh · Hugging Face


Qwen3-8B-Base SFT+DPO LoRA — UltraFeedback-zh

LoRA adapter (r=32, α=64) trained with DPO (β=0.1, sigmoid loss) on top of the SFT-merged base. Reference policy = SFT-merged base (trl's default with peft_config + ref_model=None).

Preference pairs from opencsg/UltraFeedback-chinese binarized variant (7,600 train + 400 eval). Trained on RunPod B200 in 46 minutes.

⚠️ This adapter is calibrated for the SFT-merged base, not raw Qwen3-Base. Apply SFT first, merge, then load this DPO adapter on top — see loading example.

Project repo: https://github.com/tutucheng99/qwen3-sft-dpo-eval · Full eval writeup: https://github.com/tutucheng99/qwen3-sft-dpo-eval/blob/main/docs/REPORT.md

Quick load

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
 "Qwen/Qwen3-8B-Base", dtype="bfloat16", device_map="cuda", trust_remote_code=True,
)
# 1) merge SFT into base
base = PeftModel.from_pretrained(base, "JeffCheng12138/qwen3-8b-sft-coig-cqia").merge_and_unload()
# 2) apply DPO LoRA on top
model = PeftModel.from_pretrained(base, "JeffCheng12138/qwen3-8b-dpo-ultrafeedback-zh")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B-Base", trust_remote_code=True)

Eval highlights

Evaluated against BASE and SFT on 40 hand-curated Chinese prompts (DeepSeek-chat judge with order-bias control + bootstrap 95% CI):

Pair DPO win rate 95% CI
DPO vs SFT 0.625 [0.537, 0.713] (significant)
DPO vs BASE 0.463 [0.350, 0.562] (statistical tie)

DPO recovered the SFT regression and reached parity with BASE on the judge. Reward margins climbed from 0.02 to 1.0 nats over training.

Known regression

Dimension attribution found DPO weakened refusal robustness under jailbreak framing by 0.27 nats — under fictional / research framing the model assigns substantially less probability to refusal text. Mitigation strategies for deployment are out of scope here, but flag as a known side-effect.

Training config

  • LoRA: r=32, α=64, dropout=0.05, all attention + MLP linear layers
  • DPO: β=0.1, loss_type=sigmoid, max_length=2048
  • Optimizer: AdamW fused, lr 5e-6, cosine schedule, warmup 0.1
  • 1 epoch, effective batch 16 (per-device 2 × accum 8), bf16, sdpa attention
  • Reference model: SFT-merged base (peft + ref_model=None default)
Downloads last month
3
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for JeffCheng12138/qwen3-8b-dpo-ultrafeedback-zh

Adapter
(73)
this model

Dataset used to train JeffCheng12138/qwen3-8b-dpo-ultrafeedback-zh