Reasoning Outputs¶

vLLM offers support for reasoning models like DeepSeek R1, which are designed to generate outputs containing both reasoning steps and final conclusions.

Reasoning models return an additional reasoning field in their outputs, which contains the reasoning steps that led to the final conclusion. This field is not present in the outputs of other models.

Warning

reasoning used to be called reasoning_content. To migrate, directly replace reasoning_content with reasoning.

Supported Models¶

vLLM currently supports the following reasoning models:

Model Series	Parser Name	Structured Output Support	Tool Calling
Cohere Command A Reasoning	`cohere_command3`	`json`, `regex`	✅
DeepSeek R1 series	`deepseek_r1`	`json`, `regex`	❌
Gemma 4 series	`gemma4`	`json`, `regex`	✅
DeepSeek-V3.1	`deepseek_v3`	`json`, `regex`	❌
ERNIE-4.5-VL series	`ernie45`	`json`, `regex`	❌
ERNIE-4.5-21B-A3B-Thinking	`ernie45`	`json`, `regex`	✅
GLM-4.5 series	`glm45`	`json`, `regex`	✅
Holo2 series	`holo2`	`json`, `regex`	✅
Hunyuan A13B series	`hunyuan_a13b`	`json`, `regex`	✅
IBM Granite 3.2 language models	`granite`	❌	❌
MiniMax-M2	`minimax_m2_append_think`	`json`, `regex`	✅
Qwen3 series	`qwen3`	`json`, `regex`	✅
QwQ-32B	`deepseek_r1`	`json`, `regex`	✅

Note

IBM Granite 3.2 and DeepSeek-V3.1 reasoning is disabled by default; to enable it, you must also pass thinking=True in your chat_template_kwargs. The reasoning feature for the Qwen3 series is enabled by default. To disable it, you must pass enable_thinking=False in your chat_template_kwargs. Gemma 4 reasoning is disabled by default; to enable it, pass enable_thinking=True in your chat_template_kwargs or set reasoning_effort (which enables it automatically). DeepSeek-V3.1 tool calling is supported in non-thinking mode. Holo2 reasoning is enabled by default. To disable it, you must also pass thinking=False in your chat_template_kwargs.

Quickstart¶

To use reasoning models, you need to specify the --reasoning-parser flags when making a request to the chat completion endpoint. The --reasoning-parser flag specifies the reasoning parser to use for extracting reasoning content from the model output.

vllmservedeepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B\
--reasoning-parserdeepseek_r1

Next, make a request to the model that should return the reasoning content in the response.

The reasoning field contains the reasoning steps that led to the final conclusion, while the content field contains the final conclusion.

Streaming chat completions¶

Streaming chat completions are also supported for reasoning models. The reasoning field is available in the delta field in chat completion response chunks.

OpenAI Python client library does not officially support reasoning attribute for streaming output. But the client supports extra attributes in the response. You can use hasattr to check if the reasoning attribute is present in the response. For example:

Remember to check whether the reasoning exists in the response before accessing it. You could check out the example.

Tool Calling¶

The reasoning content is also available when both tool calling and the reasoning parser are enabled. Additionally, tool calling only parses functions from the content field, not from the reasoning.

For more examples, please refer to examples/reasoning/openai_chat_completion_tool_calls_with_reasoning.py.

Server-Level Default Chat Template Kwargs¶

You can set default chat_template_kwargs at the server level using the --default-chat-template-kwargs CLI argument. This is useful for configuring reasoning behavior across all requests without requiring clients to specify it in each request.

Disabling Thinking Mode by Default¶

For models like Qwen3 where thinking is enabled by default, you can disable it server-wide:

vllmserveQwen/Qwen3-8B\
--reasoning-parserqwen3\
--default-chat-template-kwargs'{"enable_thinking": false}'

Enabling Thinking Mode by Default¶

For models like IBM Granite 3.2 or DeepSeek-V3.1 where thinking is disabled by default, you can enable it server-wide:

vllmserveibm-granite/granite-3.2-2b-instruct\
--reasoning-parsergranite\
--default-chat-template-kwargs'{"thinking": true}'

Request-Level Override¶

Request-level chat_template_kwargs always take priority over server defaults. For example, if the server is started with enable_thinking=false, a client can still enable it for a specific request:

response = client.chat.completions.create(
 model=model,
 messages=messages,
 extra_body={"chat_template_kwargs": {"enable_thinking": True}} # Overrides server default
)

Thinking Budget Control¶

Some models, such as Qwen3, DeepSeek, and Nemotron3, support a thinking budget that limits the maximum number of tokens used for reasoning.

Token counting starts from reasoning_start_str. Once the reasoning token count reaches the configured thinking_token_budget, vLLM forces the model to produce reasoning_end_str, effectively terminating the reasoning block.

To use this feature:

--reasoning-parser enables reasoning extraction.
--reasoning-config defines the reasoning boundary tokens (e.g., reasoning_start_str, reasoning_end_str). If not set, vLLM will attempt to automatically initialize these tokens from the reasoning parser.
thinking_token_budget (a sampling parameter) sets the per-request reasoning token limit.

If thinking_token_budget is not specified, no explicit reasoning limit is applied beyond normal generation constraints such as max_tokens.

--reasoning-config accepts a JSON object corresponding to
ReasoningConfig with the following fields:

Field	Type	Description
`reasoning_start_str`	`str \\| null`	String that marks the start of reasoning content
`reasoning_end_str`	`str \\| null`	String that marks the end of reasoning content

Note

reasoning_end_str can include a transition phrase before the reasoning end token. For example, setting reasoning_end_str to "I have to give the solution based on the reasoning directly now.</think>" instructs the model to emit that phrase when the budget is exhausted, making the reasoning termination more natural.

Online Serving¶

vllmserveQwen/Qwen3-0.6B\
--reasoning-parserqwen3\
--reasoning-config'{"reasoning_start_str": "<think>", "reasoning_end_str": "I have to give the solution based on the reasoning directly now.</think>"}'

Then make a request with thinking_token_budget to limit the reasoning tokens:

curlhttp://localhost:8000/v1/chat/completions\
-H"Content-Type: application/json"\
-d'{
 "model": "Qwen/Qwen3-0.6B",
 "messages": [
 { "role": "user", "content": "9.11 and 9.8, which is greater?" }
 ],
 "thinking_token_budget": 10
 }'

Offline Inference¶

fromvllmimport LLM, SamplingParams
fromvllm.configimport ReasoningConfig
llm = LLM(
 model="Qwen/Qwen3-0.6B",
 reasoning_config=ReasoningConfig(
 reasoning_start_str="<think>",
 reasoning_end_str="I have to give the solution based on the thinking directly now.</think>",
 ),
)
sampling_params = SamplingParams(thinking_token_budget=10)
messages = [
 {"role": "user", "content": "9.11 and 9.8, which is greater?"},
]
outputs = llm.chat(messages, sampling_params=sampling_params)
for output in outputs:
 print("text:", output.outputs[0].text)

Automatic `enable_thinking` Activation¶

Some models (such as Gemma 4, DeepSeek-V4-Pro and IBM Granite 3.2) require enable_thinking: true in their chat template kwargs to activate thinking mode — without it, reasoning tokens are never generated regardless of other settings.

When you set reasoning_effort in a Chat Completions request (or reasoning.effort in a Responses API request), vLLM automatically injects enable_thinking into the chat template kwargs:

reasoning_effort = "low", "medium", or "high" → enable_thinking = true
reasoning_effort = "none" → enable_thinking = false
reasoning_effort not set → enable_thinking is not injected (preserves existing behavior)

This means you no longer need to manually pass chat_template_kwargs: {"enable_thinking": true} when using reasoning_effort — it is handled automatically.

Note

If you explicitly set enable_thinking in chat_template_kwargs, your value takes priority over the automatic injection. This allows you to override the behavior if needed.

For models whose templates don't declare enable_thinking (e.g., DeepSeek R1), the injected kwarg is harmlessly filtered out by resolve_chat_template_kwargs.

Example¶

fromopenaiimport OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")
# reasoning_effort automatically enables thinking for models that need it
response = client.chat.completions.create(
 model="google/gemma-4-26B-A4B-it",
 messages=[{"role": "user", "content": "What is 15 * 37?"}],
 reasoning_effort="high", # Automatically sets enable_thinking=true
)
print(response.choices[0].message.reasoning)
print(response.choices[0].message.content)

Limitations¶

The reasoning content is only available for online serving's chat completion endpoint (/v1/chat/completions), Anthropic Messages API (/v1/messages) and the Responses API (/v1/responses).

How to support a new reasoning model¶

You can add a new ReasoningParser similar to vllm/reasoning/deepseek_r1_reasoning_parser.py.

Additionally, to enable structured output, you'll need to create a new Reasoner similar to the one in vllm/reasoning/deepseek_r1_reasoning_parser.py.

The structured output engine like xgrammar will use end_token_id to check if the reasoning content is present in the model output and skip the structured output if it is the case.

Finally, you can enable reasoning for the model by using the --reasoning-parser flags.

vllmserve<model_tag>--reasoning-parserexample

URL: https://docs.vllm.ai/en/latest/features/reasoning_outputs/