Fixed jinja chat templates for Qwen 3.5 & 3.6 (v20)

This is a drop-in Jinja template that fixes rendering errors, KV cache invalidation, token waste, and fatal agentic stalling in the official Qwen chat templates.

It is tested to work across LM Studio, llama.cpp, vLLM, MLX, oMLX, and any engine that supports HuggingFace Jinja templates.

Why you need this

The official Qwen templates contain restrictions and Python-specific Jinja logic that break usage on many inference engines and agent frameworks.

Here are the critical issues this template fixes:

Category	Problem	Impact	Fix
Agentic Loop	Premature Stalls (Stopping Bug)	Model aborts its turn (`<\|im_end\|>`) when trying to combine conversation and a tool call.	Resolved the System Prompt logic trap and cured "Empty Think" poisoning (v19).
Agentic Loop	Retry Stall & Reasoning Spiral	Model correctly diagnoses a tool error but repeatedly emits the identical failing `<tool_call>`.	Two-tier escalation: seeds `<think>` with correction directive; injects urgent out-of-band directive.
Agentic Loop	Post-Tool Overthinking	Forced `<think>` block prefilling causes model to panic and debate internal rules after fetching data.	Broadened instructions to define `<think>` as a dual-purpose space for planning or synthesis.
Agentic Loop	False-Positive Error Detection	Short successful API/JSON returns containing the word `error` trigger false retry loops.	Strict structural guards look for exact system failures (`"error":`, `Traceback`, etc.) instead of broad words (v18).
Performance	KV Cache Invalidation	History pruning dynamically mutates past turns, causing full prompt re-processing every turn.	`preserve_thinking` defaults to `true`, maintaining strict chronological rendering for a 100% KV cache hit rate (v19).
Performance	Empty Think Poisoning	Stripped past turns leave behind empty `<think></think>` tags, tricking the model into a severe in-context learning bias.	Template completely abolishes the injection of empty think blocks (v19).
Compatibility	Legacy Engine Crashes	Older C++ parsing engines crash when evaluating `loop.previtem`.	Uses strict chronological array indexing universally supported by all Jinja iterations (v18).
Compatibility	Wrong Tool Call Format	Qwen-native parsers (like vLLM's `qwen3_coder`) expect XML `<function=name>`. JSON format breaks them.	Restored native XML format while keeping C++ safety.
Compatibility	Jinja C++ Crashes	Python-specific filters (`map`, `first` on strings) crash on `minijinja`.	All filters replaced with universally compatible equivalents.
Stability	Mid-Conversation System Crash	Frameworks injecting mid-conversation steering instructions trigger a hard crash.	Native, chronological rendering for system messages anywhere in the history.
Stability	No-User-Query Crash	`raise_exception` crashes agentic loops or system-only contexts.	Graceful fallback implemented.
Stability	Unclosed Thinking Before Tool	Model calls a tool without closing its reasoning, bleeding XML tags into tool parsers.	Auto-injects closing tags before tool boundaries securely.
Edge Cases	`developer` Role Rejected	Modern APIs send the developer role; the official template rejects it.	Added full support for `"developer"`.
Edge Cases	`--reasoning off` Ignored	When thinking is disabled, tool error escalation still opened a `<think>` block, corrupting the prompt.	Error escalation branches now fully respect `enable_thinking=false`.
Edge Cases	Reasoning Bypass Hallucinations	When thinking is disabled, Qwen models inherently hallucinate reasoning tags anyway.	Injects a safe boundary to successfully force reasoning bypass without stacking newlines (v18).

Quick install

Choose your environment and update the template:

LM Studio

Open your Qwen model in the right-side panel.
Scroll down to Prompt Template.
Replace the template with the contents of chat_template.jinja.
Click Save.

llama.cpp / koboldcpp

--jinja --chat-template-file chat_template.jinja

vLLM

Replace the "chat_template" string in your tokenizer_config.json with the raw file contents. Use the qwen3_coder tool parser:

--tool-call-parser qwen3_coder

oMLX

Overwrite chat_template.jinja in your local model directory. Load with --jinja. Remove any chat_template_kwargs overrides because the template handles everything internally.

Which file do I use?

Both Qwen 3.5 and Qwen 3.6 variants (including 35B, 32B, 27B, and 14B parameters) have been consolidated. You only need the single chat_template.jinja file at the root of the repository.

One-line versions (chat_template_oneline.txt) are pre-minified for engines that require a single-line template string.

The thinking toggle

You can control the model reasoning behavior. Insert <|think_on|> or <|think_off|> anywhere in your system or user prompt.

The template natively intercepts the tag, removes it from the final context so the model never sees it, and flips the reasoning mode instantly.

Fast answer, no reasoning:

System: You are a coding assistant. <|think_off|>
User: What's 2+2?

Deep reasoning:

System: You are a coding assistant. <|think_on|>
User: Implement a red-black tree in Rust.

(The tag syntax uses Qwen's control-token delimiters to guarantee it will never collide with legitimate text or file paths, unlike earlier community templates that used /think)

Token Saving: Stripping past thoughts

By default in v19, this template preserves all past <think> blocks in the chat history. This is intentional: it prevents the model from suffering "amnesia stalls" during complex, multi-step agentic loops, and it mathematically guarantees a 100% Prefix KV Cache hit rate on local inference engines.

However, if you are running constrained hardware and need to save context tokens, you can explicitly disable this feature in your engine's template kwargs to automatically strip past thoughts:

{
 "preserve_thinking": false
}

(Note: Setting this to false will naturally reduce your KV Cache hit rate during multi-turn chats, as the prompt string will dynamically mutate).

Running the test suite

python3 scripts/test_v20.py

Tests cover: auto_disable_thinking_with_tools, payload truncation logic, parallel tool spacing, mid-conversation system rendering, deep agent loop fallback, XML tool format, <|think_off|> / <|think_on|> inline overrides, and all legacy v19 regression tests.

Authorship

Role	Author
Original models	Alibaba Cloud (Qwen team)
Template fixes	froggeric
C++ AST optimizations	barubary / `spiritbuun`

License

Apache-2.0, inherited from Qwen.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including froggeric/Qwen-Fixed-Chat-Templates

Rewritten Jinja templates fixing 5 bugs in official Qwen 3.5/3.6. Works in LM Studio, llama.cpp, MLX, vLLM. • 1 item • Updated Apr 30 • 4

URL: https://huggingface.co/froggeric/Qwen-Fixed-Chat-Templates

⇱ froggeric/Qwen-Fixed-Chat-Templates · Hugging Face