VOOZH about

URL: https://huggingface.co/bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF

⇱ bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF · Hugging Face


Reasoning · Pi Tune · MTP · GGUF

Qwen3.6 27B tuned for reasoning-heavy local coding agents.

A 4-bit QLoRA SFT reasoning release for Pi-style terminal loops: inspect the repo, plan the fix, run commands, edit files, validate, and recover when the first attempt fails.

🧠 Qwen3.6 27B base 🔎 Reasoning-supervised ⚡ MTP speculative decoding 🛠️ Coding · DevOps · Agents 📦 llama.cpp GGUF 🖼️ Vision sidecar compatible 🪟 128k tested context
🧭
Better planning before action
Trained on passed agent trajectories with reasoning preserved, so the model can decompose tasks before committing to commands and edits.
🧪
Verifier-driven debugging
Strongest in loops where the agent can run tests, inspect failures, patch code, and validate the whole task end-to-end.
MTP at every quant
The MTP next-token draft heads are kept at Q8_0 precision inside each quant, enabling speculative decoding even with low-bit GGUF files.
🚀
Start with Q4_K_M
The default recommendation is Q4_K_M: a practical quality / memory tradeoff for local coding-agent experiments.

Model overview

Attribute Details
Base model Qwen/Qwen3.6-27B
Format GGUF
Runtime target llama.cpp / OpenAI-compatible local serving
Tuning focus Pi-style terminal agents, repository work, command use, file edits, validation loops
Fine-tuning style 4-bit QLoRA SFT on private passed agent trajectories
Reasoning supervision Preserved reasoning traces for complex agent tasks
Recommended first file Qwen3.6-27B-MTP-pi-reasoning-Q4_K_M.gguf
Vision support Compatible with the Qwen3.6 mmproj-F16.gguf sidecar
Technical writeup Qwen3.6 27B reasoning writeup

This is the reasoning-oriented sibling to bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF. Use this release for the stronger Pi-style coding-agent behavior. Use the no-thinking / direct tune when you specifically want lower-latency instruct-style turns.

Terminal-Bench 2.0 Pi snapshot

All results below are single-run pass@1 Terminal-Bench 2.0 results on the 89-task set using Pi with llama.cpp and Q4 GGUF. See the technical writeup for the broader training and evaluation context.

Run Harness Runtime Quantization Pass Total Score
Base Qwen3.6 27B Pi llama.cpp Q4 GGUF 38 89 42.70%
Qwen3.6 MTP pi-tune Pi llama.cpp Q4 GGUF 25 89 28.09%
Qwen3.6 MTP pi-reasoning Pi llama.cpp Q4 GGUF 36 89 40.45%

The base model remained the strongest single Pi run, but the reasoning-tuned model recovered most of the performance lost by the earlier direct/no-thinking tune and showed a different behavioral profile:

  • stronger task decomposition;
  • more targeted debugging loops;
  • better end-to-end validation on some infrastructure and data tasks;
  • more willingness to investigate system state before editing.

The tradeoff is that the same exploratory behavior can become over-exploration when the model fails to converge. On some failures it spends too long investigating or validating and can time out.

These are applied harness-fluency results, not a complete measure of Qwen3.6 27B coding ability. They are single-run pass@1; the base run used a longer timeout than the reasoning run, so timeout-related comparisons should be read carefully.

Why this tune exists

Qwen3.6 already has strong coding ability and supports long-context, reasoning-capable inference. The goal of this release was narrower: improve how the model behaves inside a lightweight coding harness where the next useful action is often a terminal command, file edit, or verifier check.

The training target is the full agent loop:

understand task → inspect files → plan → run commands → edit → test → recover → finish

The reasoning-tuned variant is most useful when tasks are ambiguous, multi-step, or failure-prone — the cases where a little deliberation before action can save several bad edits later.

Pick a file

Recommended starting point: Q4_K_M.

Quant File size VRAM estimate Suggested use
Q2_K ~11 GB ~13 GB Smallest footprint; expect quality tradeoffs.
Q3_K_S ~12 GB ~15 GB Low-memory 3-bit option.
Q3_K_M ~14 GB ~16 GB Balanced 3-bit option.
Q3_K_L ~15 GB ~17 GB Higher-quality 3-bit option.
Q4_K_S ~16 GB ~18 GB Smaller 4-bit option.
Q4_K_M ~17 GB ~19 GB Best first choice for most local users.
Q5_K_S ~19 GB ~21 GB Higher-quality 5-bit option.
Q5_K_M ~20 GB ~22 GB Strong quality/memory tradeoff.
Q6_K ~22 GB ~25 GB High-quality local inference if you have memory headroom.
bf16 ~55 GB ~58 GB BF16 GGUF reference.

VRAM estimates are approximate and depend on context length, KV cache type, GPU offload, batch settings, and runtime build. Longer contexts require more memory.

Every quant in this release keeps the MTP nextn prediction heads at Q8_0 precision. If your llama.cpp build supports draft-mtp, speculative decoding can be used with any quant above.

Quickstart

llama.cpp with MTP speculative decoding

This profile matches the Pi evaluation style most closely: long context, reasoning enabled by the harness/client, and MTP draft decoding.

llama-server -hf bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF:Q4_K_M \
 --spec-type draft-mtp \
 --spec-draft-n-max 3 \
 -np 1 \
 --jinja -ngl 99 -fa -c 131072 \
 --cache-type-k q4_0 --cache-type-v q4_0 \
 --temp 1.0 --top-p 0.95 --top-k 0 --min-p 0

Notes:

  • -c 131072 advertises a 128k context window. Reduce it if you are memory constrained.
  • The Pi evaluations used a max generation budget of 8192 tokens per turn and reasoning budget of 4096 where the client/harness exposed that control.

llama.cpp without MTP

llama-server -hf bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF:Q4_K_M \
 --jinja -ngl 99 -fa -c 131072 \
 --cache-type-k q4_0 --cache-type-v q4_0 \
 --temp 1.0 --top-p 0.95 --top-k 0 --min-p 0

Direct / instruct-style turns

For lower-latency direct responses, use Qwen3.6's instruct-style sampling and disable thinking in clients that expose chat-template options.

llama-server -hf bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF:Q4_K_M \
 --jinja -ngl 99 -fa -c 131072 \
 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 \
 --presence-penalty 1.5

If direct/no-thinking mode is your main use case, the sibling release may be a better fit: bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF.

llama.cpp --jinja template note

Some llama.cpp versions may fail with the embedded Qwen Jinja chat template:

No user query found in messages.

If this happens, use a llama.cpp-compatible fixed Qwen template, for example:

https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

This appears to be a chat-template compatibility issue rather than a model-weight issue.

Ollama

ollama run hf.co/bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF:Q4_K_M

Download one file

hf download bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF \
 Qwen3.6-27B-MTP-pi-reasoning-Q4_K_M.gguf \
 --local-dir .

Vision / image-text-to-text

This release is compatible with the Qwen3.6 mmproj-F16.gguf sidecar for vision-language inference. The fine-tune is language-side only; image/video understanding is inherited from the upstream base model.

hf download bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF \
 Qwen3.6-27B-MTP-pi-reasoning-Q4_K_M.gguf \
 --local-dir .

hf download unsloth/Qwen3.6-27B-MTP-GGUF \
 mmproj-F16.gguf \
 --local-dir .

llama-server -m ./Qwen3.6-27B-MTP-pi-reasoning-Q4_K_M.gguf \
 --mmproj ./mmproj-F16.gguf \
 --jinja -ngl 99 -fa -c 131072 \
 --temp 1.0 --top-p 0.95 --top-k 0 --min-p 0

For a quick text-and-image session without spinning up a server:

llama-mtmd-cli -m ./Qwen3.6-27B-MTP-pi-reasoning-Q4_K_M.gguf \
 --mmproj ./mmproj-F16.gguf

OpenAI-compatible local API

llama-server exposes an OpenAI-compatible /v1/chat/completions endpoint, so existing OpenAI-style clients and custom harnesses can point at it directly.

from openai import OpenAI

client = OpenAI(
 base_url="http://localhost:8080/v1",
 api_key="not-needed",
)

resp = client.chat.completions.create(
 model="Qwen3.6-27B-MTP-pi-reasoning",
 messages=[
 {"role": "system", "content": "You are a precise coding agent."},
 {"role": "user", "content": "Inspect this failing test output and propose the next terminal command."},
 ],
 temperature=1.0,
 top_p=0.95,
)

print(resp.choices[0].message.content)

The same endpoint can be used with streaming and OpenAI-style tool definitions, depending on your client and harness.

Training notes

The reasoning release was trained with 4-bit QLoRA SFT on private, passed agent trajectories. The v2 training set contained roughly 1,200 passed trajectories with reasoning preserved as a first-class target.

The dataset emphasized:

  • terminal and shell-environment tasks;
  • tool / function-calling interactions;
  • multi-language code editing and repair;
  • repository issue resolution and test-driven patching;
  • infrastructure, data, package, migration, and verifier-driven tasks.

The key change from the earlier pi-tune release is that reasoning was not stripped out of the training target. The model was trained to preserve the plan-and-act structure that showed up in successful trajectories, then hand off into concrete commands and edits.

Sample traces

A compact public sample of successful DeepSeek V4 Pro Pi reasoning teacher trajectories is available here:

bytkim/deepseek-v4-pro-pi-reasoning-sample-traces

The sample contains 70 pass-only trajectories across seven task providers, formatted as Qwen-style reasoning-target segments with Pi tool-use structure.

Performance profile

Representative local profile from the llama.cpp / Pi setup:

Measure Approximate value What it means
Prompt / prefill ~615 tok/s Reading and processing prompt/context tokens.
Decode / generation ~40 tok/s Raw generated-token speed reported by llama.cpp.
End-to-end request ~71 tok/s Combined prompt + decode request throughput.
MTP draft acceptance ~78% Share of drafted tokens accepted by the main decode path.
Effective agent output ~33 tok/s Output tokens divided by full agent execution time.

Throughput depends on hardware, context length, KV cache type, llama.cpp build, draft settings, and task shape. Treat these as a practical profile rather than a guarantee.

Recommended use cases

  • Local coding-agent experiments.
  • Tool-heavy chat and function-calling experiments.
  • Repository navigation, patch planning, and test iteration.
  • DevOps troubleshooting and runbook drafting.
  • Long-context engineering workflows where local inference is preferred.
  • Reasoning-heavy debugging, architecture review, data tasks, and verifier-guided repair loops.

Limitations

  • This is a community research release, not a guaranteed drop-in replacement for the base model.
  • All reported Terminal-Bench results are single-run pass@1.
  • The base Pi run used a longer timeout than the reasoning run; timeout-sensitive comparisons should be interpreted carefully.
  • Low-bit quantizations may reduce instruction following, tool-call reliability, and long-horizon task success.
  • Reasoning mode emits extra tokens before final answers; use direct-response mode or the no-thinking sibling release when latency matters more than deliberation.
  • The full training dataset is private; the linked public sample is provided to show representative trace format and task coverage, not to reproduce the full training mixture.
  • The language fine-tune does not improve the inherited vision tower.
  • Use normal caution for generated code, shell commands, dependency changes, and security-sensitive workflows.

License

Released under the Apache 2.0 license inherited from the upstream Qwen3.6-27B base model. You may use, modify, and redistribute the model and derivatives subject to that license.

Acknowledgements

Thanks to the Qwen team for releasing Qwen3.6-27B, to the llama.cpp / ggml community for local inference and MTP support, and to the open-source infrastructure around GGUF quantization and local model serving.

This work also benefited from the broader ecosystem around Terminal-Bench, Pi, OpenHands, Codex-style terminal agents, Hugging Face hosting, and trajectory-analysis tooling.

Built for local agent loops
Reasoning where it helps · tool use where it matters · GGUF for the terminal loop.
40.45%
Pi Terminal-Bench 2.0 pass@1
Q4_K_M
recommended start
MTP
speculative decoding ready
Downloads last month
7,217
GGUF
Model size
27B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

6-bit

16-bit

Model tree for bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF

Base model

Qwen/Qwen3.6-27B
Finetuned
(235)
this model

Dataset used to train bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF

Article mentioning bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF

Evaluation results