VOOZH about

URL: https://huggingface.co/danielcherubini/Qwen3.5-DeltaCoder-9B

โ‡ฑ danielcherubini/Qwen3.5-DeltaCoder-9B ยท Hugging Face


Qwen3.5-DeltaCoder-9B

Reliable tool-calling for agentic coding โ€” LoRA fine-tune of Qwen3.5-9B v1.1-DPO released โ€” DPO alignment improves code correctness and self-verification. If you downloaded before March 28, 2026, please re-pull to get v1.1-DPO.

๐Ÿ‘ License: Apache 2.0
๐Ÿ‘ Base Model
๐Ÿ‘ HuggingFace
๐Ÿ‘ LoRA

Small language models can reason about code, but they struggle to call tools reliably. DeltaCoder takes a strong reasoning base and teaches it to produce correctly-formatted JSON tool calls โ€” the kind that coding agents like OpenCode, Pi, and Cline depend on.

v1.1-DPO adds Direct Preference Optimization to further improve code correctness โ€” the model now self-corrects its own bugs rather than submitting wrong answers.

Downloads

Format Link Size
GGUF Q4_K_M (recommended) HuggingFace ~5.5 GB
GGUF Q5_K_M HuggingFace ~6.5 GB
GGUF BF16 HuggingFace ~17.9 GB
DPO LoRA adapter HuggingFace ~700 MB

The Problem

Jackrong's Qwen3.5-9B reasoning distill scores 53.7% on HumanEval โ€” best-in-class at 9B. But when used as a coding agent, it frequently produces malformed JSON tool calls:

tool=edit, error=JSON Parse error: Property name must be a string literal
tool=bash, error=JSON Parse error: Expected '}'

DeltaCoder fixes this, and v1.1-DPO further improves code correctness through preference learning.

What's New in v1.1-DPO

  • Self-correcting behavior โ€” detects and fixes its own bugs during agentic tasks
  • Improved code correctness โ€” trained on 4,519 preference pairs from AceCode-V2-122K
  • Two-stage merge โ€” v1 SFT tool-calling improvements + DPO code quality improvements combined
  • 13 GGUF quants โ€” from Q2_K to BF16, covering all VRAM configurations

Training Details

v1 โ€” SFT (Tool-Call Reliability)

Parameter Value
Base model Qwen3.5-9B (hybrid GDN architecture)
Method LoRA (r=64, alpha=32)
Dataset CoderForge-Preview filtered_reward1 (50K subset)
Sequence length 4096
Effective batch size 16
Learning rate 1e-4 (cosine)
Epochs 1
Hardware NVIDIA H200 140GB (Vast.ai)
Training time ~10 hours
Final loss ~0.94

v1.1 โ€” DPO (Code Correctness)

Parameter Value
Method DPO (Direct Preference Optimization)
Dataset AceCode-V2-122K โ€” 4,519 preference pairs
Pair generation 10K problems ร— 8 samples, keep if โ‰ฅ1 pass AND โ‰ฅ1 fail (45% keep rate)
Beta 0.1
Loss type sigmoid
Learning rate 5e-6 (cosine)
Effective batch size 16
Hardware NVIDIA H100 80GB (Vast.ai)
Training time ~3.7 hours
Final loss 0.538
Rewards/margins (final) ~1.0
Rewards/accuracies (final) ~80%

LoRA Target Modules

All major weight matrices adapted across the hybrid architecture:

  • Full Attention (8/32 layers): q_proj, k_proj, v_proj, o_proj
  • Gated Delta Net (24/32 layers): in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj
  • MLP (all 32 layers): gate_proj, up_proj, down_proj

Usage

Ollama

ollama create deltacoder -f Modelfile

llama.cpp / ik_llama.cpp

./llama-server -m DeltaCoder-9B-v1.1-DPO-Q5_K_M.gguf -ngl 999 -c 131072 -ctk f16 -ctv q4_0 -fa 1 --jinja

With PEFT (Python)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained(
 "Jackrong/Qwen3.5-9B-Claude-4.6-Opus-Reasoning-Distilled-v2",
 torch_dtype=torch.bfloat16,
 trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, "danielcherubini/Qwen3.5-DeltaCoder-9B")
tokenizer = AutoTokenizer.from_pretrained("danielcherubini/Qwen3.5-DeltaCoder-9B")

Benchmarks

Model HumanEval HumanEval+ Terminal-Bench Easy
Jackrong Qwen3.5-9B-v2 (base) 53.7% โ€” โ€”
DeltaCoder-9B v1 (temp=0.6) 50.6% 49.4% 2/4 (50%)
DeltaCoder-9B v1.1-DPO (temp=0.6) TBD TBD 2/4 (50%)*

*v1.1-DPO timed out on 2 tasks that v1 answered incorrectly โ€” behavioral improvement confirmed, re-evaluating with extended timeout.

Recommended Sampling Settings

Parameter Value
temperature 0.6
top_k 20
top_p 0.95
min_p 0.0
presence_penalty 0.0
repeat_penalty 1.0

Do not use temperature below 0.5 โ€” low temperatures cause deterministic looping in multi-turn agentic use.

KV Cache Quantization

Context Length KV Cache VRAM (Q4_K_M) Generation Speed
102,400 f16/q4_0 ~8.5 GB ~111 tok/s
131,072 f16/q4_0 ~9.1 GB ~110 tok/s

Key Findings

Qwen3.5 is a VLM โ€” Unsloth treats it as a vision model. For text-only DPO training, use standard HuggingFace + PEFT + TRL directly (no Unsloth DPOTrainer).

Do not use flash_attention_2 with sample packing on Qwen3.5 โ€” training loss goes to 0. Use attn_implementation="eager" instead.

  • Qwen3.5 uses Gated Delta Networks โ€” include in_proj_qkv, in_proj_z, in_proj_b, in_proj_a, out_proj in LoRA target modules or 75% of attention layers are untrained
  • DPO pairs generated on-policy using Qwen/Qwen3.5-9B base with vLLM async inference (32 concurrent requests)
  • Keep rate of 45.2% from 10K AceCode problems (4,519 pairs used for training)

Project Structure

scripts/
 train_unsloth.py # v1 SFT training
 train_dpo.py # v1.1 DPO training (HF + PEFT + TRL)
 generate_dpo_pairs.py # Async on-policy pair generation
 merge_and_export_dpo.py # Two-stage merge + GGUF export

Status

  • v1 SFT fine-tune (CoderForge, H200, ~10hrs)
  • GGUF export (all quants Q2_K โ†’ BF16)
  • HumanEval benchmarking (50.6% / 49.4%)
  • Terminal-Bench evaluation (2/4 easy tasks)
  • DPO pair generation (4,519 pairs from AceCode-V2-122K)
  • v1.1-DPO training (H100, ~3.7hrs)
  • v1.1-DPO GGUF export + HuggingFace release
  • v1.1-DPO HumanEval benchmarking
  • v1.1-DPO Terminal-Bench extended timeout evaluation

Acknowledgements

Downloads last month

-

Downloads are not tracked for this model. How to track

Model tree for danielcherubini/Qwen3.5-DeltaCoder-9B

Finetuned
Qwen/Qwen3.5-9B
Adapter
(364)
this model
Quantizations
1 model

Datasets used to train danielcherubini/Qwen3.5-DeltaCoder-9B