Fixed jinja chat templates for Qwen 3.5 & 3.6 (v20)
This is a drop-in Jinja template that fixes rendering errors, KV cache invalidation, token waste, and fatal agentic stalling in the official Qwen chat templates.
It is tested to work across LM Studio, llama.cpp, vLLM, MLX, oMLX, and any engine that supports HuggingFace Jinja templates.
Why you need this
The official Qwen templates contain restrictions and Python-specific Jinja logic that break usage on many inference engines and agent frameworks.
Here are the critical issues this template fixes:
| Category | Problem | Impact | Fix |
|---|---|---|---|
| Agentic Loop | Premature Stalls (Stopping Bug) | Model aborts its turn (<|im_end|>) when trying to combine conversation and a tool call. |
Resolved the System Prompt logic trap and cured "Empty Think" poisoning (v19). |
| Agentic Loop | Retry Stall & Reasoning Spiral | Model correctly diagnoses a tool error but repeatedly emits the identical failing <tool_call>. |
Two-tier escalation: seeds <think> with correction directive; injects urgent out-of-band directive. |
| Agentic Loop | Post-Tool Overthinking | Forced <think> block prefilling causes model to panic and debate internal rules after fetching data. |
Broadened instructions to define <think> as a dual-purpose space for planning or synthesis. |
| Agentic Loop | False-Positive Error Detection | Short successful API/JSON returns containing the word error trigger false retry loops. |
Strict structural guards look for exact system failures ("error":, Traceback, etc.) instead of broad words (v18). |
| Performance | KV Cache Invalidation | History pruning dynamically mutates past turns, causing full prompt re-processing every turn. | preserve_thinking defaults to true, maintaining strict chronological rendering for a 100% KV cache hit rate (v19). |
| Performance | Empty Think Poisoning | Stripped past turns leave behind empty <think></think> tags, tricking the model into a severe in-context learning bias. |
Template completely abolishes the injection of empty think blocks (v19). |
| Compatibility | Legacy Engine Crashes | Older C++ parsing engines crash when evaluating loop.previtem. |
Uses strict chronological array indexing universally supported by all Jinja iterations (v18). |
| Compatibility | Wrong Tool Call Format | Qwen-native parsers (like vLLM's qwen3_coder) expect XML <function=name>. JSON format breaks them. |
Restored native XML format while keeping C++ safety. |
| Compatibility | Jinja C++ Crashes | Python-specific filters (map, first on strings) crash on minijinja. |
All filters replaced with universally compatible equivalents. |
| Stability | Mid-Conversation System Crash | Frameworks injecting mid-conversation steering instructions trigger a hard crash. | Native, chronological rendering for system messages anywhere in the history. |
| Stability | No-User-Query Crash | raise_exception crashes agentic loops or system-only contexts. |
Graceful fallback implemented. |
| Stability | Unclosed Thinking Before Tool | Model calls a tool without closing its reasoning, bleeding XML tags into tool parsers. | Auto-injects closing tags before tool boundaries securely. |
| Edge Cases | developer Role Rejected |
Modern APIs send the developer role; the official template rejects it. | Added full support for "developer". |
| Edge Cases | --reasoning off Ignored |
When thinking is disabled, tool error escalation still opened a <think> block, corrupting the prompt. |
Error escalation branches now fully respect enable_thinking=false. |
| Edge Cases | Reasoning Bypass Hallucinations | When thinking is disabled, Qwen models inherently hallucinate reasoning tags anyway. | Injects a safe boundary to successfully force reasoning bypass without stacking newlines (v18). |
Quick install
Choose your environment and update the template:
LM Studio
- Open your Qwen model in the right-side panel.
- Scroll down to Prompt Template.
- Replace the template with the contents of
chat_template.jinja. - Click Save.
llama.cpp / koboldcpp
--jinja --chat-template-file chat_template.jinja
vLLM
Replace the "chat_template" string in your tokenizer_config.json with the raw file contents. Use the qwen3_coder tool parser:
--tool-call-parser qwen3_coder
oMLX
Overwrite chat_template.jinja in your local model directory. Load with --jinja. Remove any chat_template_kwargs overrides because the template handles everything internally.
Which file do I use?
Both Qwen 3.5 and Qwen 3.6 variants (including 35B, 32B, 27B, and 14B parameters) have been consolidated. You only need the single chat_template.jinja file at the root of the repository.
One-line versions (chat_template_oneline.txt) are pre-minified for engines that require a single-line template string.
The thinking toggle
You can control the model reasoning behavior. Insert <|think_on|> or <|think_off|> anywhere in your system or user prompt.
The template natively intercepts the tag, removes it from the final context so the model never sees it, and flips the reasoning mode instantly.
Fast answer, no reasoning:
System: You are a coding assistant. <|think_off|>
User: What's 2+2?
Deep reasoning:
System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.
(The tag syntax uses Qwen's control-token delimiters to guarantee it will never collide with legitimate text or file paths, unlike earlier community templates that used /think)
Token Saving: Stripping past thoughts
By default in v19, this template preserves all past <think> blocks in the chat history. This is intentional: it prevents the model from suffering "amnesia stalls" during complex, multi-step agentic loops, and it mathematically guarantees a 100% Prefix KV Cache hit rate on local inference engines.
However, if you are running constrained hardware and need to save context tokens, you can explicitly disable this feature in your engine's template kwargs to automatically strip past thoughts:
{
"preserve_thinking": false
}
(Note: Setting this to false will naturally reduce your KV Cache hit rate during multi-turn chats, as the prompt string will dynamically mutate).
Running the test suite
python3 scripts/test_v20.py
Tests cover: auto_disable_thinking_with_tools, payload truncation logic, parallel tool spacing, mid-conversation system rendering, deep agent loop fallback, XML tool format, <|think_off|> / <|think_on|> inline overrides, and all legacy v19 regression tests.
Authorship
| Role | Author |
|---|---|
| Original models | Alibaba Cloud (Qwen team) |
| Template fixes | froggeric |
| C++ AST optimizations | barubary / spiritbuun |
License
Apache-2.0, inherited from Qwen.
