Qwen3.6 27B tuned for reasoning-heavy local coding agents.
A 4-bit QLoRA SFT reasoning release for Pi-style terminal loops: inspect the repo, plan the fix, run commands, edit files, validate, and recover when the first attempt fails.
Model overview
| Attribute | Details |
|---|---|
| Base model | Qwen/Qwen3.6-27B |
| Format | GGUF |
| Runtime target | llama.cpp / OpenAI-compatible local serving |
| Tuning focus | Pi-style terminal agents, repository work, command use, file edits, validation loops |
| Fine-tuning style | 4-bit QLoRA SFT on private passed agent trajectories |
| Reasoning supervision | Preserved reasoning traces for complex agent tasks |
| Recommended first file | Qwen3.6-27B-MTP-pi-reasoning-Q4_K_M.gguf |
| Vision support | Compatible with the Qwen3.6 mmproj-F16.gguf sidecar |
| Technical writeup | Qwen3.6 27B reasoning writeup |
This is the reasoning-oriented sibling to bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF. Use this release for the stronger Pi-style coding-agent behavior. Use the no-thinking / direct tune when you specifically want lower-latency instruct-style turns.
Terminal-Bench 2.0 Pi snapshot
All results below are single-run pass@1 Terminal-Bench 2.0 results on the 89-task set using Pi with llama.cpp and Q4 GGUF. See the technical writeup for the broader training and evaluation context.
| Run | Harness | Runtime | Quantization | Pass | Total | Score |
|---|---|---|---|---|---|---|
| Base Qwen3.6 27B | Pi | llama.cpp | Q4 GGUF | 38 | 89 | 42.70% |
| Qwen3.6 MTP pi-tune | Pi | llama.cpp | Q4 GGUF | 25 | 89 | 28.09% |
| Qwen3.6 MTP pi-reasoning | Pi | llama.cpp | Q4 GGUF | 36 | 89 | 40.45% |
The base model remained the strongest single Pi run, but the reasoning-tuned model recovered most of the performance lost by the earlier direct/no-thinking tune and showed a different behavioral profile:
- stronger task decomposition;
- more targeted debugging loops;
- better end-to-end validation on some infrastructure and data tasks;
- more willingness to investigate system state before editing.
The tradeoff is that the same exploratory behavior can become over-exploration when the model fails to converge. On some failures it spends too long investigating or validating and can time out.
These are applied harness-fluency results, not a complete measure of Qwen3.6 27B coding ability. They are single-run
pass@1; the base run used a longer timeout than the reasoning run, so timeout-related comparisons should be read carefully.
Why this tune exists
Qwen3.6 already has strong coding ability and supports long-context, reasoning-capable inference. The goal of this release was narrower: improve how the model behaves inside a lightweight coding harness where the next useful action is often a terminal command, file edit, or verifier check.
The training target is the full agent loop:
understand task → inspect files → plan → run commands → edit → test → recover → finish
The reasoning-tuned variant is most useful when tasks are ambiguous, multi-step, or failure-prone — the cases where a little deliberation before action can save several bad edits later.
Pick a file
Recommended starting point: Q4_K_M.
| Quant | File size | VRAM estimate | Suggested use |
|---|---|---|---|
Q2_K |
~11 GB | ~13 GB | Smallest footprint; expect quality tradeoffs. |
Q3_K_S |
~12 GB | ~15 GB | Low-memory 3-bit option. |
Q3_K_M |
~14 GB | ~16 GB | Balanced 3-bit option. |
Q3_K_L |
~15 GB | ~17 GB | Higher-quality 3-bit option. |
Q4_K_S |
~16 GB | ~18 GB | Smaller 4-bit option. |
Q4_K_M |
~17 GB | ~19 GB | Best first choice for most local users. |
Q5_K_S |
~19 GB | ~21 GB | Higher-quality 5-bit option. |
Q5_K_M |
~20 GB | ~22 GB | Strong quality/memory tradeoff. |
Q6_K |
~22 GB | ~25 GB | High-quality local inference if you have memory headroom. |
bf16 |
~55 GB | ~58 GB | BF16 GGUF reference. |
VRAM estimates are approximate and depend on context length, KV cache type, GPU offload, batch settings, and runtime build. Longer contexts require more memory.
Every quant in this release keeps the MTP
nextnprediction heads atQ8_0precision. If your llama.cpp build supportsdraft-mtp, speculative decoding can be used with any quant above.
Quickstart
llama.cpp with MTP speculative decoding
This profile matches the Pi evaluation style most closely: long context, reasoning enabled by the harness/client, and MTP draft decoding.
llama-server -hf bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF:Q4_K_M \
--spec-type draft-mtp \
--spec-draft-n-max 3 \
-np 1 \
--jinja -ngl 99 -fa -c 131072 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--temp 1.0 --top-p 0.95 --top-k 0 --min-p 0
Notes:
-c 131072advertises a 128k context window. Reduce it if you are memory constrained.- The Pi evaluations used a max generation budget of 8192 tokens per turn and reasoning budget of 4096 where the client/harness exposed that control.
llama.cpp without MTP
llama-server -hf bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF:Q4_K_M \
--jinja -ngl 99 -fa -c 131072 \
--cache-type-k q4_0 --cache-type-v q4_0 \
--temp 1.0 --top-p 0.95 --top-k 0 --min-p 0
Direct / instruct-style turns
For lower-latency direct responses, use Qwen3.6's instruct-style sampling and disable thinking in clients that expose chat-template options.
llama-server -hf bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF:Q4_K_M \
--jinja -ngl 99 -fa -c 131072 \
--temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 \
--presence-penalty 1.5
If direct/no-thinking mode is your main use case, the sibling release may be a better fit: bytkim/Qwen3.6-27B-MTP-pi-tune-GGUF.
llama.cpp --jinja template note
Some llama.cpp versions may fail with the embedded Qwen Jinja chat template:
No user query found in messages.
If this happens, use a llama.cpp-compatible fixed Qwen template, for example:
https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates
This appears to be a chat-template compatibility issue rather than a model-weight issue.
Ollama
ollama run hf.co/bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF:Q4_K_M
Download one file
hf download bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF \
Qwen3.6-27B-MTP-pi-reasoning-Q4_K_M.gguf \
--local-dir .
Vision / image-text-to-text
This release is compatible with the Qwen3.6 mmproj-F16.gguf sidecar for vision-language inference. The fine-tune is language-side only; image/video understanding is inherited from the upstream base model.
hf download bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF \
Qwen3.6-27B-MTP-pi-reasoning-Q4_K_M.gguf \
--local-dir .
hf download unsloth/Qwen3.6-27B-MTP-GGUF \
mmproj-F16.gguf \
--local-dir .
llama-server -m ./Qwen3.6-27B-MTP-pi-reasoning-Q4_K_M.gguf \
--mmproj ./mmproj-F16.gguf \
--jinja -ngl 99 -fa -c 131072 \
--temp 1.0 --top-p 0.95 --top-k 0 --min-p 0
For a quick text-and-image session without spinning up a server:
llama-mtmd-cli -m ./Qwen3.6-27B-MTP-pi-reasoning-Q4_K_M.gguf \
--mmproj ./mmproj-F16.gguf
OpenAI-compatible local API
llama-server exposes an OpenAI-compatible /v1/chat/completions endpoint, so existing OpenAI-style clients and custom harnesses can point at it directly.
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed",
)
resp = client.chat.completions.create(
model="Qwen3.6-27B-MTP-pi-reasoning",
messages=[
{"role": "system", "content": "You are a precise coding agent."},
{"role": "user", "content": "Inspect this failing test output and propose the next terminal command."},
],
temperature=1.0,
top_p=0.95,
)
print(resp.choices[0].message.content)
The same endpoint can be used with streaming and OpenAI-style tool definitions, depending on your client and harness.
Training notes
The reasoning release was trained with 4-bit QLoRA SFT on private, passed agent trajectories. The v2 training set contained roughly 1,200 passed trajectories with reasoning preserved as a first-class target.
The dataset emphasized:
- terminal and shell-environment tasks;
- tool / function-calling interactions;
- multi-language code editing and repair;
- repository issue resolution and test-driven patching;
- infrastructure, data, package, migration, and verifier-driven tasks.
The key change from the earlier pi-tune release is that reasoning was not stripped out of the training target. The model was trained to preserve the plan-and-act structure that showed up in successful trajectories, then hand off into concrete commands and edits.
Sample traces
A compact public sample of successful DeepSeek V4 Pro Pi reasoning teacher trajectories is available here:
bytkim/deepseek-v4-pro-pi-reasoning-sample-traces
The sample contains 70 pass-only trajectories across seven task providers, formatted as Qwen-style reasoning-target segments with Pi tool-use structure.
Performance profile
Representative local profile from the llama.cpp / Pi setup:
| Measure | Approximate value | What it means |
|---|---|---|
| Prompt / prefill | ~615 tok/s | Reading and processing prompt/context tokens. |
| Decode / generation | ~40 tok/s | Raw generated-token speed reported by llama.cpp. |
| End-to-end request | ~71 tok/s | Combined prompt + decode request throughput. |
| MTP draft acceptance | ~78% | Share of drafted tokens accepted by the main decode path. |
| Effective agent output | ~33 tok/s | Output tokens divided by full agent execution time. |
Throughput depends on hardware, context length, KV cache type, llama.cpp build, draft settings, and task shape. Treat these as a practical profile rather than a guarantee.
Recommended use cases
- Local coding-agent experiments.
- Tool-heavy chat and function-calling experiments.
- Repository navigation, patch planning, and test iteration.
- DevOps troubleshooting and runbook drafting.
- Long-context engineering workflows where local inference is preferred.
- Reasoning-heavy debugging, architecture review, data tasks, and verifier-guided repair loops.
Limitations
- This is a community research release, not a guaranteed drop-in replacement for the base model.
- All reported Terminal-Bench results are single-run
pass@1. - The base Pi run used a longer timeout than the reasoning run; timeout-sensitive comparisons should be interpreted carefully.
- Low-bit quantizations may reduce instruction following, tool-call reliability, and long-horizon task success.
- Reasoning mode emits extra tokens before final answers; use direct-response mode or the no-thinking sibling release when latency matters more than deliberation.
- The full training dataset is private; the linked public sample is provided to show representative trace format and task coverage, not to reproduce the full training mixture.
- The language fine-tune does not improve the inherited vision tower.
- Use normal caution for generated code, shell commands, dependency changes, and security-sensitive workflows.
License
Released under the Apache 2.0 license inherited from the upstream Qwen3.6-27B base model. You may use, modify, and redistribute the model and derivatives subject to that license.
Acknowledgements
Thanks to the Qwen team for releasing Qwen3.6-27B, to the llama.cpp / ggml community for local inference and MTP support, and to the open-source infrastructure around GGUF quantization and local model serving.
This work also benefited from the broader ecosystem around Terminal-Bench, Pi, OpenHands, Codex-style terminal agents, Hugging Face hosting, and trajectory-analysis tooling.
- Downloads last month
- 7,217
2-bit
3-bit
4-bit
5-bit
6-bit
16-bit
Model tree for bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF
Base model
Qwen/Qwen3.6-27BDataset used to train bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF
Article mentioning bytkim/Qwen3.6-27B-MTP-pi-reasoning-GGUF
Evaluation results
- pass@1 on Terminal-Bench 2.0Qwen3.6 27B reasoning writeup40.450
