GPT-5.5 works best when prompts define the outcome and leave room for the model to choose an efficient solution path. Compared with earlier models, you can often use shorter, more outcome-oriented prompts: describe what good looks like, what constraints matter, what evidence is available, and what the final answer should contain.
Avoid carrying over every instruction from an older prompt stack. Legacy prompts often over-specify the process because earlier models needed more help staying on track. With GPT-5.5, that can add noise, narrow the model’s search space, or lead to overly mechanical answers.
For more detail on GPT-5.5 behavior changes, start with the Using GPT-5.5 guide. This guide focuses on prompt changes that follow from those behavior changes.
The patterns here are starting points. Adapt them to your product surface, tools, evals, and user experience goals.
Automated migration with Codex
Codex can implement the changes from this guide with the OpenAI Docs Skill.
To use this skill in other coding agents, download it from the OpenAI skills repository.
Personality and behavior
GPT-5.5’s default style is efficient, direct, and task-oriented. This is useful for production systems: responses stay focused, behavior is easier to steer, and the model avoids unnecessary conversational padding.
For customer-facing assistants, support workflows, coaching experiences, and other conversational products, define both personality and collaboration style.
- Personality controls how the assistant sounds: tone, warmth, directness, formality, humor, empathy, and level of polish.
- Collaboration style controls how the assistant works: when it asks questions, when it makes assumptions, how proactive it should be, how much context it gives, when it checks work, and how it handles uncertainty or risk.
Keep both short. Personality instructions should shape the user experience. Collaboration instructions should shape task behavior. Neither should replace clear goals, success criteria, tool rules, or stopping conditions.
Example personality block for a steady task-focused assistant:
Example personality block for an expressive collaborative assistant:
For more expressive products, add warmth, curiosity, humor, or point of view explicitly, but keep the block short. Use personality to shape the experience, not to compensate for unclear goals or missing task instructions.
Improve time to first visible token with a preamble
In streaming applications, users notice how long it takes before the first visible response appears. GPT-5.5 may spend time reasoning, planning, or preparing tool calls before emitting visible text.
For longer or tool-heavy tasks, prompt the model to start with a short preamble: a brief visible update that acknowledges the request and states the first step. This can improve perceived responsiveness without changing the underlying task.
Use this pattern when the task may take more than one step, require tool calls, or involve a long-running agent workflow.
For coding agents that expose separate message phases, you can be more explicit:
Outcome-first prompts and stopping conditions
GPT-5.5 is strongest when the prompt defines the target outcome, success criteria, constraints, and available context, then lets the model choose the path.
For many tasks, describe the destination rather than every step. This gives the model room to choose the right search, tool, or reasoning strategy for the task.
Prefer this:
Avoid unnecessary absolute rules. Older prompts often use strict instructions like ALWAYS, NEVER, must, and only to control model behavior. Use those words for true invariants, such as safety rules, required output fields, or actions that should never happen. For judgment calls, such as when to search, ask for clarification, use a tool, or keep iterating, prefer decision rules instead.
Avoid this style of instruction unless every step is truly required:
Add explicit stopping conditions:
Define missing-evidence behavior:
Formatting
GPT-5.5 is highly steerable on output format and structure. Use that control when it improves comprehension or product fit.
Set text.verbosity, describe the expected output shape, and reserve heavier structure for cases where it improves comprehension or your product UI needs a stable artifact. The API default for text.verbosity is medium; use low when you prefer shorter, more concise responses.
Plain conversational formatting:
Add explicit audience and length guidance:
For editing, rewriting, summaries, or customer-facing messages, tell the model what to preserve before asking it to improve style. This pattern is useful when you want polish without expansion.
Grounding, citations, and retrieval budgets
For grounded answers, citation behavior should be part of the prompt. Define what needs support, what counts as enough evidence, and how the model should behave when evidence is missing. Absence of evidence shouldn’t automatically become a factual “no.” For more details and examples, see the citation formatting guide.
Add an explicit retrieval budget
Retrieval budgets are stopping rules for search. They tell the model when enough evidence is enough.
Creative drafting guardrails
For drafting tasks, tell the model which claims must come from sources and which parts may be creatively written. This is especially important for slides, launch copy, customer summaries, talk tracks, leadership blurbs, and narrative framing.
Frontend engineering and visual taste
For frontend work, refer to the example instructions for practical ways to steer UI quality. They cover product and user context, design-system alignment, first-screen usability, familiar controls, expected states, responsive behavior, and common generated-UI defaults to avoid, such as generic heroes, nested cards, decorative gradients, visible instructional text, and broken layouts.
Prompt the model to check its work
Give GPT-5.5 access to tools that let it check outputs when validation is possible.
For coding agents, ask for concrete validation commands:
For visual artifacts, ask for inspection after rendering:
For engineering and planning tasks, make implementation plans traceable:
Phase parameter
Starting with GPT-5.4, long-running or tool-heavy Responses workflows can use assistant-item phase values to distinguish intermediate updates from final answers. GPT-5.5 uses the same pattern.
If you use previous_response_id, the API preserves prior assistant state automatically. If your application manually replays assistant output items into the next request, preserve each original phase value and pass it back unchanged. This matters most when a response includes preambles, repeated tool calls, or a final answer after intermediate assistant updates.
Suggested prompt structure
Use this structure as a starting point for complex prompts. Keep each section short. Add detail only where it changes behavior.
GPT-5.4 is designed to balance long-running task performance, stronger control over style and behavior, and more disciplined execution across complex workflows. Building on advances from GPT-5 through GPT-5.3-Codex, GPT-5.4 improves token efficiency, sustains multi-step workflows more reliably, and performs well on long-horizon tasks.
GPT-5.4 is designed for production-grade assistants and agents that need strong multi-step reasoning, evidence-rich synthesis, and reliable performance over long contexts. It is especially effective when prompts clearly specify the output contract, tool-use expectations, and completion criteria. In practice, the biggest gains come from choosing the right reasoning effort for the task, using explicit grounding and citation rules, and giving the model a precise definition of what “done” looks like. This guide focuses on prompt patterns and migration practices that preserve those efficiency wins. For model capabilities, API parameters, and broader migration guidance, see our latest model guide.
When troubleshooting cases where GPT-5.4 treats an intermediate update as the
final answer, verify your integration preserves the assistant message phase
field correctly. See Phase parameter for details.
Understand GPT-5.4 behavior
Where GPT-5.4 is strongest
GPT-5.4 tends to work especially well in these areas:
- Strong personality and tone adherence, with less drift over long answers
- Agentic workflow robustness, with a stronger tendency to stick with multi-step work, retry, and complete agent loops end to end
- Evidence-rich synthesis, especially in long-context or multi-tool workflows
- Instruction adherence in modular, skill-based, and block-structured prompts when the contract is explicit
- Long-context analysis across large, messy, or multi-document inputs
- Batched or parallel tool calling while maintaining tool-call accuracy
- Spreadsheet, finance, and Excel workflows that need instruction following, formatting fidelity, and stronger self-verification
Where explicit prompting still helps
Even with those strengths, GPT-5.4 benefits from more explicit guidance in a few recurring patterns:
- Low-context tool routing early in a session, when tool selection can be less reliable
- Dependency-aware workflows that need explicit prerequisite and downstream-step checks
- Reasoning effort selection, where higher effort is not always better and the right choice depends on task shape, not intuition
- Research tasks that require disciplined source collection and consistent citations
- Irreversible or high-impact actions that require verification before execution
- Terminal or coding-agent environments where tool boundaries must stay clear
These patterns are observed defaults, not guarantees. Start with the smallest prompt that passes your evals, and add blocks only when they fix a measured failure mode.
Use core prompt patterns
Keep outputs compact and structured
To improve token efficiency with GPT-5.4, constrain verbosity and enforce structured output through clear output contracts. In practice, this acts as an additional control layer alongside the verbosity parameter in the Responses API, allowing you to guide both how much the model writes and how it structures the output.
Set clear defaults for follow-through
Users often change the task, format, or tone mid-conversation. To keep the assistant aligned, define clear rules for when to proceed, when to ask, and how newer instructions override earlier defaults.
Use a default follow-through policy like this:
Make instruction priority explicit:
Higher-priority developer or system instructions remain binding.
Guidance: When instructions change mid-conversation, make the update explicit, scoped, and local. State what changed, what still applies, and whether the change affects the next turn or the rest of the conversation.
Handle mid-conversation instruction updates
For mid-conversation updates, use explicit, scoped steering messages that state:
- Scope
- Override
- Carry forward
If the task itself changes, say so directly:
Make tool use persistent when correctness depends on it
Use explicit rules to keep tool use thorough, dependency-aware, and appropriately paced, especially in workflows where later actions rely on earlier retrieval or verification. A common failure mode is skipping prerequisites because the right end state seems obvious.
GPT-5.4 can be less reliable at tool routing early in a session, when context is still thin. Prompt for prerequisites, dependency checks, and exact tool intent.
This is especially important for workflows where the final action depends on earlier lookup or retrieval steps. One of the most common failure modes is skipping prerequisites because the intended end state seems obvious.
Prompt for parallelism when the work is independent and wall-clock matters. Prompt for sequencing when dependencies, ambiguity, or irreversible actions matter more than speed.
Force completeness on long-horizon tasks
For multi-step workflows, a common failure mode is incomplete execution: the model finishes after partial coverage, misses items in a batch, or treats empty or narrow retrieval as final. GPT-5.4 becomes more reliable when the prompt defines explicit completion rules and recovery behavior.
Coverage can be achieved through sequential or parallel retrieval, but completion rules should remain explicit either way.
For workflows where empty, partial, or noisy retrieval is common:
Add a verification loop before high-impact actions
Once the workflow appears complete, add a lightweight verification step before returning the answer or taking an irreversible action. This helps catch requirement misses, grounding issues, and format drift before commit.
For agents that actively take actions, add a short execution frame:
Handle specialized workflows
Choose image detail explicitly for vision and computer use
If your workflow depends on visual precision, specify the image detail level in the prompt or integration instead of relying on auto. Use high for standard high-fidelity image understanding. Use original for large, dense, or spatially sensitive images, especially computer use, localization, OCR, and click-accuracy tasks on gpt-5.4 and future models. Use low only when speed and cost matter more than fine detail. For more details on image detail levels, see the Images and Vision guide.
Lock research and citations to retrieved evidence
When citation quality matters, make both the source boundary and the format requirement explicit. This helps reduce fabricated references, unsupported claims, and citation-format drift.
If your application requires inline citations, require inline citations. If it requires footnotes, require footnotes. The key is to lock the format and prevent the model from improvising unsupported references.
Research mode
Push GPT-5.4 into a disciplined research mode. Use this pattern for research, review, and synthesis tasks. Do not force it onto short execution tasks or simple deterministic transforms.
If your host environment uses a specific research tool or requires a submit step, combine this with the host’s finalization contract.
Clamp strict output formats
For SQL, JSON, or other parse-sensitive outputs, tell GPT-5.4 to emit only the target format and check it before finishing.
If you are extracting document regions or OCR boxes, define the coordinate system and add a drift check:
Keep tool boundaries explicit in coding and terminal agents
In coding agents, GPT-5.4 works better when the rules for shell access and file editing are unambiguous. This is especially important when you expose tools like Shell or Apply patch.
User updates
GPT-5.4 does well with brief, outcome-based updates. Reuse the user-updates pattern from the 5.2 guide, but pair it with explicit completion and verification requirements.
Recommended update spec:
For coding agents, see the Prompting patterns for coding tasks section below for more specific guidance.
Prompting patterns for coding tasks
Autonomy and persistence
GPT-5.4 is generally more thorough end to end than earlier mainline models on coding and tool-use tasks, so you often need less explicit “verify everything” prompting. Still, for high-stakes changes such as production, migrations, or security work, keep a lightweight verification clause.
Intermediary updates
Keep updates sparse and high-signal. In coding tasks, prefer updates at key points.
Formatting
GPT-5.4 often defaults to more structured formatting and may overuse bullet lists. If you want a clean final response, explicitly clamp list shape.
Frontend tasks
Use this only when additional frontend guidance is useful.
Document localization and OCR boxes
For bbox tasks, be explicit about coordinate conventions and add drift tests.
Use runtime and API integration notes
For long-running or tool-heavy agents, the runtime contract matters as much as the prompt contract.
Phase parameter
For GPT-5.4, gpt-5.3-codex, and later Responses models, the phase field can
help in the small number of long-running or tool-heavy flows where preambles or
other intermediate assistant updates are mistaken for the final answer.
phaseis optional at the API level, but it is highly recommended. Best-effort inference may exist server-side, but explicit round-tripping ofphaseis strictly better.- Use
phasefor long-running or tool-heavy agents that may emit commentary before tool calls or before a final answer. - Preserve
phasewhen replaying prior assistant items so the model can distinguish working commentary from the completed answer. This matters most in multi-step flows with preambles, tool-related updates, or multiple assistant messages in the same turn. - Do not add
phaseto user messages. - If you use
previous_response_id, that is usually the simplest path, since OpenAI can often recover prior state without manually replaying assistant items. - If you replay assistant history yourself, preserve the original
phasevalues. - Missing or dropped
phasecan cause preambles to be interpreted as final answers and degrade behavior on those multi-step tasks.
Preserve behavior in long sessions
Compaction unlocks significantly longer effective context windows, where user conversations can persist for many turns without hitting context limits or long-context performance degradation, and agents can perform very long trajectories that exceed a typical context window for long-running, complex tasks.
If you are using Compaction in the Responses API, compact after major milestones, treat compacted items as opaque state, and keep prompts functionally identical after compaction. The endpoint is ZDR compatible and returns an encrypted_content item that you can pass into future requests. GPT-5.4 tends to remain more coherent and reliable over longer, multi-turn conversations with fewer breakdowns as sessions grow.
For more guidance, see the /responses/compact API reference.
Control personality for customer-facing workflows
GPT-5.4 can be steered more effectively when you separate persistent personality from per-response writing controls. This is especially useful for customer-facing workflows such as emails, support replies, announcements, and blog-style content.
- Personality (persistent): sets the default tone, verbosity, and decision style across the session.
- Writing controls (per response): define the channel, register, formatting, and length for a specific artifact.
- Reminder: personality should not override task-specific output requirements. If the user asks for JSON, return JSON.
For natural, high-quality prose, the highest-leverage controls are:
- Give the model a clear persona.
- Specify the channel and emotional register.
- Explicitly ban formatting when you want prose.
- Use hard length limits.
For more personality patterns you can lift directly, see the Prompt Personalities cookbook.
Professional memo mode
For memos, reviews, and other professional writing tasks, general writing instructions are often not enough. These workflows benefit from explicit guidance on specificity, domain conventions, synthesis, and calibrated certainty.
This mode is especially useful for legal, policy, research, and executive-facing writing, where the goal is not just fluency, but disciplined synthesis and clear conclusions.
Tune reasoning and migration
Treat reasoning effort as a last-mile knob
Reasoning effort is not one-size-fits-all. Treat it as a last-mile tuning knob, not the primary way to improve quality. In many cases, stronger prompts, clear output contracts, and lightweight verification loops recover much of the performance teams might otherwise seek through higher reasoning settings.
Recommended defaults:
none: Best for fast, cost-sensitive, latency-sensitive tasks where the model does not need to think.low: Works well for latency-sensitive tasks where a small amount of thinking can produce a meaningful accuracy gain, especially with complex instructions.mediumorhigh: Reserve for tasks that truly require stronger reasoning and can absorb the latency and cost tradeoff. Choose between them based on how much performance gain your task gets from additional reasoning.xhigh: Avoid as a default unless your evals show clear benefits. It is best suited for long, agentic, reasoning-heavy tasks where maximum intelligence matters more than speed or cost.
In practice, most teams should default to the none, low, or medium range.
Start with none for execution-heavy workloads such as workflow steps, field extraction, support triage, and short structured transforms.
Start with medium or higher for research-heavy workloads such as long-context synthesis, multi-document review, conflict resolution, and strategy writing. With medium and a well-engineered prompt, you can squeeze out a lot of performance.
For GPT-5.4 workloads, none can already perform well on action-selection and tool-discipline tasks. If your workload depends on nuanced interpretation, such as implicit requirements, ambiguity, or cancelled-tool-call recovery, start with low or medium instead.
Before increasing reasoning effort, first add:
<completeness_contract><verification_loop><tool_persistence_rules>
If the model still feels too literal or stops at the first plausible answer, add an initiative nudge before raising reasoning effort:
Migrate prompts to GPT-5.4 one change at a time
Use the same one-change-at-a-time discipline as the 5.2 guide: switch model first, pin reasoning_effort, run evals, then iterate.
These starting points work well for many migrations:
| Current setup | Suggested GPT-5.4 start | Notes |
|---|---|---|
gpt-5.2 | Match the current reasoning effort | Preserve the existing latency and quality profile first, then tune. |
gpt-5.3-codex | Match the current reasoning effort | For coding workflows, keep the reasoning effort the same. |
gpt-4.1 or gpt-4o | none | Keep snappy behavior, and increase only if evals regress. |
| Research-heavy assistants | medium or high | Use explicit research multi-pass and citation gating. |
| Long-horizon agents | medium or high | Add tool persistence and completeness accounting. |
Small-model guidance for gpt-5.4-mini and gpt-5.4-nano
gpt-5.4-mini and gpt-5.4-nano are highly steerable, but they are less likely than larger models to infer missing steps, resolve ambiguity implicitly, or package outputs the way you intended unless you specify that behavior directly. In practice, prompts for smaller models are often a bit longer and more explicit.
How gpt-5.4-mini differs
gpt-5.4-miniis more literal and makes fewer assumptions.- It is strong when the task is clearly structured, but weaker on implicit workflows and ambiguity handling.
- By default, it may try to keep the conversation going with a follow-up question unless you suppress that behavior explicitly.
Prompting gpt-5.4-mini
- Put critical rules first.
- Specify the full execution order when tool use or side effects matter.
- Do not rely on “you MUST” alone. Use structural scaffolding such as numbered steps, decision rules, and explicit action definitions.
- Separate “do the action” from “report the action.”
- Show the correct flow, not just the final format.
- Define ambiguity behavior explicitly: when to ask, abstain, or proceed.
- Specify packaging directly: answer length, whether to ask a follow-up question, citation style, and section order.
- Be careful with
output nothing else. Prefer scoped instructions such asafter the final JSON, output nothing further.
Prompting gpt-5.4-nano
- Use
gpt-5.4-nanoonly for narrow, well-bounded tasks. - Prefer closed outputs: labels, enums, short JSON, or fixed templates.
- Avoid multi-step orchestration unless the flow is extremely constrained.
- Route ambiguous or planning-heavy tasks to a stronger model instead of over-prompting
gpt-5.4-nano.
Good default pattern
- Task
- Critical rule
- Exact step order
- Edge cases or clarification behavior
- Output format
- One correct example
Avoid
- Implied next steps
- Unspecified edge cases
- Schema-only prompts for tool workflows
- Generic instructions without structure
Web search and deep research
If you are migrating a research agent in particular, make these prompt updates before increasing reasoning effort:
- Add
<research_mode> - Add
<citation_rules> - Add
<empty_result_recovery> - Increase
reasoning_effortone notch only after prompt fixes.
You can start from the 5.2 research block and then layer in citation gating and finalization contracts as needed.
GPT-5.4 performs especially well when the task requires multi-step evidence gathering, long-context synthesis, and explicit prompt contracts. In practice, the highest-leverage prompt changes are choosing reasoning effort by task shape, defining exact output and citation formats, adding dependency-aware tool rules, and making completion criteria explicit. The model is often strong out of the box, but it is most reliable when prompts clearly specify how to search, how to verify, and what counts as done.
Next steps
- Read our latest model guide for model capabilities, parameters, and API compatibility details.
- Read Prompt engineering for broader prompting strategies that apply across model families.
- Read Compaction if you are building long-running GPT-5.4 sessions in the Responses API.
Codex models advance the frontier of intelligence and efficiency and our recommended agentic coding model. Follow this guide closely to ensure you’re getting the best performance possible from this model. This guide is for anyone using the model directly via the API for maximum customizability; we also have the Codex SDK for simpler integrations.
In the API, the Codex-tuned model is gpt-5.3-codex (see the model page).
Recent improvements to Codex models
- Faster and more token efficient: Uses fewer thinking tokens to accomplish a task. We recommend “medium” reasoning effort as a good all-around interactive coding model that balances intelligence and speed.
- Higher intelligence and long-running autonomy: Codex is very capable and will work autonomously for hours to complete your hardest tasks. You can use
highorxhighreasoning effort for your hardest tasks. - First-class compaction support: Compaction enables multi-hour reasoning without hitting context limits and longer continuous user conversations without needing to start new chat sessions.
- Codex is also much better in PowerShell and Windows environments.
Getting Started
If you already have a working Codex implementation, this model should work well with relatively minimal updates, but if you’re starting with a prompt and set of tools that’s optimized for GPT-5-series models, or a third-party model, we recommend making more significant changes. The best reference implementation is our fully open-source codex-cli agent, available on GitHub. Clone this repo and use Codex (or any coding agent) to ask questions about how things are implemented. From working with customers, we’ve also learned how to customize agent harnesses beyond this particular implementation.
Key steps to migrate your harness to codex-cli:
- Update your prompt: If you can, start with our standard Codex-Max prompt as your base and make tactical additions from there.
a) The most critical snippets are those covering autonomy and persistence, codebase exploration, tool use, and frontend quality.
b) You should also remove all prompting for the model to communicate an upfront plan, preambles, or other status updates during the rollout, as this can cause the model to stop abruptly before the rollout is complete. - Update your tools, including our apply_patch implementation and other best practices below. This is a major lever for getting the most performance.
Prompting
Recommended Starter Prompt
This prompt began as the default GPT-5.1-Codex-Max prompt and was further optimized against internal evals for answer correctness, completeness, quality, correct tool usage and parallelism, and bias for action. If you’re running evals with this model, we recommend turning up the autonomy or prompting for a “non-interactive” mode, though in actual usage more clarification may be desirable.
Mid-Rollout User Updates
The Codex model family can surface mid-rollout user updates while it’s working. For codex versions prior to gpt-5.3-codex, these updates are system-generated rather than promptable, so we advise against adding instructions to the prompt about intermediate plans or messages to the user for those. For gpt-5.3-codex and after, these updates are more communicative and provide more critical information about what’s happening and why and work similarly to how intermediate messages work for other GPT-5 series models and can be prompted according to the Preambles & Personality section below.
Using agents.md
Codex-cli automatically enumerates these files and injects them into the conversation; the model has been trained to closely adhere to these instructions.
1. Files are pulled from ~/.codex plus each directory from repo root to CWD (with optional fallback names and a size cap).
2. They’re merged in order, later directories overriding earlier ones.
3. Each merged chunk shows up to the model as its own user-role message like so:
Additional details
- Each discovered file becomes its own user-role message that starts with # AGENTS.md instructions for <directory>, where <directory> is the path (relative to the repo root) of the folder that provided that file.
- Messages are injected near the top of the conversation history, before the user prompt, in root-to-leaf order: global instructions first, then repo root, then each deeper directory. If an AGENTS.override.md was used, its directory name still appears in the header (e.g., # AGENTS.md instructions for backend/api), so the context is obvious in the transcript.
Compaction
Compaction unlocks significantly longer effective context windows, where user conversations can persist for many turns without hitting context window limits or long context performance degradation, and agents can perform very long trajectories that exceed a typical context window for long-running, complex tasks. A weaker version of this was previously possible with ad-hoc scaffolding and conversation summarization, but our first-class implementation, available via the Responses API, is integrated with the model and is highly performant.
How it works:
- You use the Responses API as today, sending input items that include tool calls, user inputs, and assistant messages.
- When your context window grows large, you can invoke /compact to generate a new, compacted context window. Two things to note:
- The context window that you send to /compact should fit within your model’s context window.
- The endpoint is ZDR compatible and will return an “encrypted_content” item that you can pass into future requests.
- For subsequent calls to the /responses endpoint, you can pass your updated, compacted list of conversation items (including the added compaction item). The model retains key prior state with fewer conversation tokens.
For endpoint details see our /responses/compact docs.
Tools
- We strongly recommend using our exact
apply_patchimplementation as the model has been trained to excel at this diff format. For terminal commands we recommend ourshelltool, and for plan/TODO items ourupdate_plantool should be most performant. - If you prefer your agent to use more “terminal-like tools” (like
file_read()instead of calling `sed` in the terminal), this model can reliably call them instead of terminal (following the instructions below) - For other tools, including semantic search, MCPs, or other custom tools, they can work but it requires more tuning and experimentation.
Apply_patch
The easiest way to implement apply_patch is with our first-class implementation in the Responses API, but you can also use our freeform tool implementation with context-free grammar. Both are demonstrated below.
Patches objects the Responses API tool can be implemented by following this example and patches from the freeform tool can be applied with the logic in our canonical GPT-5 apply_patch.py implementation.
Shell_command
This is our default shell tool. Note that we have seen better performance with a command type “string” rather than a list of commands.
If you’re using Windows PowerShell, update to this tool description.
You can check out codex-cli for the implementation for exec_command, which launches a long-lived PTY when you need streaming output, REPLs, or interactive sessions; and write_stdin, to feed extra keystrokes (or just poll output) for an existing exec_command session.
Update Plan
This is our default TODO tool; feel free to customize as you’d prefer. See the ## Plan tool section of our starter prompt for additional instructions to maintain hygiene and tweak behavior.
View_image
This is a basic function used in codex-cli for the model to view images.
Dedicated terminal-wrapping tools
If you would prefer your codex agent to use terminal-wrapping tools (like a dedicated list_dir(‘.’) tool instead of terminal(‘ls .’), this generally works well. We see the best results when the name of the tool, the arguments, and the output are as close as possible to those from the underlying command, so it’s as in-distribution as possible for the model (which was primarily trained using a dedicated terminal tool). For example, if you notice the model using git via the terminal and would prefer it to use a dedicated tool, we found that creating a related tool, and adding a directive in the prompt to only use that tool for git commands, fully mitigated the model’s terminal usage for git commands.
Other Custom Tools (web search, semantic search, memory, etc.)
The model hasn’t necessarily been post-trained to excel at these tools, but we have seen success here as well. To get the most out of these tools, we recommend:
- Making the tool names and arguments as semantically “correct” as possible, for example “search” is ambiguous but “semantic_search” clearly indicates what the tool does, relative to other potential search-related tools you might have. “Query” would be a good param name for this tool.
- Be explicit in your prompt about when, why, and how to use these tools, including good and bad examples.
- It could also be helpful to make the results look different from outputs the model is accustomed to seeing from other tools, for example ripgrep results should look different from semantic search results to avoid the model collapsing into old habits.
Parallel Tool Calling
In codex-cli, when parallel tool calling is enabled, the responses API request sets parallel_tool_calls: true and the following snippet is added to the system instructions:
We’ve found it to be helpful and more in-distribution if parallel tool call items and responses are ordered in the following way:
Tool Response Truncation
We recommend doing tool call response truncation as follows to be as in-distribution for the model as possible:
- Limit to 10k tokens. You can cheaply approximate this by computing
num_bytes/4. - If you hit the truncation limit, you should use half of the budget for the beginning, half for the end, and truncate in the middle with
…3 tokens truncated…
New features in GPT-5.3 Codex
Preamble messages
The Responses API has been updated to include a new phase parameter intended to prevent early stopping and other misbehaviors when preamble messages are requested by the prompt. phase is currently only supported with gpt-5.3-codex. Check out implementation details below. Correctly implementing this parameter is required for gpt-5.3-codex; otherwise, significant performance degradation can occur.
Phase
To better support preamble messages with gpt-5.3-codex, the Responses API includes a phase field designed to prevent early stopping on longer-running tasks and other misbehaviors.
Values
phase is one of:
null"commentary""final_answer"
Where it appears
You’ll receive phase on assistant output items (for example, output_item.done). Your integration must persist assistant output items, including their phase, and pass those assistant items back in subsequent requests.
Important: phase is only supported on assistant items. Do not add phase to user messages.
How it’s used downstream
When the model marks an output item with:
phase: "commentary": the corresponding assistant message should be treated as commentary/preamble-style content.phase: "final_answer": the corresponding assistant message should be treated as the final closeout.
Correctly preserving phase on assistant items is required for gpt-5.3-codex. If assistant phase metadata is dropped during history reconstruction, significant performance degradation can occur.
Preambles & Personality
Preambles are messages sent along with tool calls that provide user updates while working: short, human-readable progress and intent snapshots that keep the user oriented without turning the transcript into a tool-call log. GPT-5.3-Codex preambles have been tuned toward the following characteristics:
- Acknowledge then plan before any tool calls (1 sentence acknowledgement, 1–2 sentence plan).
- Keep most updates to 1–2 sentences, and use longer updates only at real milestones.
- Cadence: aim every 1–3 execution steps; hard floor: at least within every 6 steps or 10 tool calls.
- Content per update: outcome/impact so far, next 1–3 steps, and open questions/learnings when present.
- Tone: real person pairing, low-ceremony; avoid headings/status labels and log voice.
Personality (Friendly vs Pragmatic)
Personality is the higher-level vibe and collaboration posture that sits above preamble mechanics (cadence, length, and grounding). It affects word choice, how eagerly the model explains tradeoffs, and how much warmth it brings to the interaction.
The Codex app and CLI ship with support for two personalities provided here as example implementations for your harness.
Friendly
- More human, partner-y pairing energy.
- Slightly more acknowledgement, reassurance, and context-setting.
- Better when the user benefits from narrative orientation (onboarding, ambiguous tasks, higher-stakes changes).
Example Friendly personality prompt snippet from codex-cli
This snippet can be used in your system prompt to steer the pair programming personality of the model.
Pragmatic
- More terse, direct, let’s ship delivery.
- Fewer social flourishes; higher ratio of actionable information per token.
- Better when latency/throughput matters, or your users already know the workflow and just want progress and results.
Troubleshooting & Metaprompting
Common failure modes we’ve been explicitly tracking:
- Overthinking / long time before first useful action (tool call or concrete plan).
- Loggy / unnatural status updates instead of pair programmer collaboration.
- Awkward preamble phrasing and repetitive tics (“Good catch”, “Aha”, “Got it–”, etc.).
Metaprompting for targeted fixes
Failure modes like the ones above can typically be addressed through metaprompting. It’s possible to ask the model at the end of a turn that didn’t perform up to expectations how to improve its own instructions. The following prompt was used to produce some of the solutions to overthinking problems above and can be modified to meet your particular needs.
When metaprompting inside a specific context, it is important to generate responses a few times if possible and pay attention to elements of the responses that are common between them. Some improvements or changes the model proposes might be overly specific to that particular situation, but you can often simplify them to arrive at a general improvement. We recommend creating an eval to measure whether a particular prompt change is better or worse for your particular use case.
Some examples
- For overthinking / slow starts: ask it to propose instruction changes that reduce time-to-first-tool-call or first concrete plan.
- For overly loggy preambles: ask it to rewrite your user updates instructions to satisfy your particular preference constraints.
1. Introduction
GPT-5.2 is a flagship model for enterprise and agentic workloads, designed to deliver higher accuracy, stronger instruction following, and more disciplined execution across complex workflows. Building on GPT-5.1, GPT-5.2 improves token efficiency on medium-to-complex tasks, produces cleaner formatting with less unnecessary verbosity, and shows clear gains in structured reasoning, tool grounding, and multimodal understanding.
GPT-5.2 is especially well-suited for production agents that prioritize reliability, evaluability, and consistent behavior. It performs strongly across coding, document analysis, finance, and multi-tool agentic scenarios, often matching or exceeding leading models on task completion. At the same time, it remains prompt-sensitive and highly steerable in tone, verbosity, and output shape, making explicit prompting an important part of successful deployments.
While GPT-5.2 works well out of the box for many use cases, this guide focuses on prompt patterns and migration practices that maximize performance in real production systems. These recommendations are drawn from internal testing and customer feedback, where small changes to prompt structure, verbosity constraints, and reasoning settings often translate into large gains in correctness, latency, and developer trust.
2. Key behavioral differences
Compared with previous generation models (e.g. GPT-5 and GPT-5.1), GPT-5.2 delivers:
- More deliberate scaffolding: Builds clearer plans and intermediate structure by default; benefits from explicit scope and verbosity constraints.
- Generally lower verbosity: More concise and task-focused, though still prompt-sensitive and preference needs to be articulated in the prompt.
- Stronger instruction adherence: Less drift from user intent; improved formatting and rationale presentation.
- Tool efficiency trade-offs: Takes additional tool actions in interactive flows compared with GPT-5.1, can be further optimized via prompting.
- Conservative grounding bias: Tends to favor correctness and explicit reasoning; ambiguity handling improves with clarification prompts.
This guide focuses on prompting GPT-5.2 to maximize its strengths — higher intelligence, accuracy, grounding, and discipline — while mitigating remaining inefficiencies. Existing GPT-5 / GPT-5.1 prompting guidance largely carries over and remains applicable.
3. Prompting patterns
Adapt following themes into your prompts for better steer on GPT-5.2
3.1 Controlling verbosity and output shape
Give clear and concrete length constraints especially in enterprise and coding agents.
Example clamp adjust based on desired verbosity:
3.2 Preventing Scope drift (e.g., UX / design in frontend tasks)
GPT-5.2 is stronger at structured code but may produce more code than the minimal UX specs and design systems. To stay within the scope, explicitly forbid extra features and uncontrolled styling.
For design system enforcement, reuse your 5.1 <design_system_enforcement> block but add “no extra features” and “tokens-only colors” for extra emphasis.
3.3 Long-context and recall
For long-context tasks, the prompt may benefit from force summarization and re-grounding. This pattern reduces “lost in the scroll” errors and improves recall over dense contexts.
3.4 Handling ambiguity & hallucination risk
Configure the prompt for overconfident hallucinations on ambiguous queries (e.g., unclear requirements, missing constraints, or questions that need fresh data but no tools are called).
Mitigation prompt:
You can also add a short self-check step for high-risk outputs:
4. Compaction (Extending Effective Context)
For long-running, tool-heavy workflows that exceed the standard context window, GPT-5.2 with Reasoning supports response compaction via the /responses/compact endpoint. Compaction performs a loss-aware compression pass over prior conversation state, returning encrypted, opaque items that preserve task-relevant information while dramatically reducing token footprint. This allows the model to continue reasoning across extended workflows without hitting context limits.
When to use compaction
- Multi-step agent flows with many tool calls
- Long conversations where earlier turns must be retained
- Iterative reasoning beyond the maximum context window
Key properties
- Produces opaque, encrypted items (internal logic may evolve)
- Designed for continuation, not inspection
- Compatible with GPT-5.2 and Responses API
- Safe to run repeatedly in long sessions
Compact a Response
Endpoint
What it does
Runs a compaction pass over a conversation and returns a compacted response object. Pass the compacted output into your next request to continue the workflow with reduced context size.
Best practices
- Monitor context usage and plan ahead to avoid hitting context window limits
- Compact after major milestones (e.g., tool-heavy phases), not every turn
- Keep prompts functionally identical when resuming to avoid behavior drift
- Treat compacted items as opaque; don’t parse or depend on internals
For guidance on when and how to compact in production, see the Conversation State guide and Compact a Response page.
Here is an example:
5. Agentic steerability & user updates
GPT-5.2 is strong on agentic scaffolding and multi-step execution when prompted well. You can reuse your GPT-5.1 <user_updates_spec> and <solution_persistence> blocks.
Two key tweaks could be added to further push the performance of GPT-5.2:
- Clamp verbosity of updates (shorter, more focused).
- Make scope discipline explicit (don’t expand problem surface area).
Example updated spec:
6. Tool-calling and parallelism
GPT-5.2 improves on 5.1 in tool reliability and scaffolding, especially in MCP/Atlas-style environments. Best practices as applicable to GPT-5 / 5.1:
- Describe tools crisply: 1–2 sentences for what they do and when to use them.
- Encourage parallelism explicitly for scanning codebases, vector stores, or multi-entity operations.
- Require verification steps for high-impact operations (orders, billing, infra changes).
Example tool usage section:
7. Structured extraction, PDF, and Office workflows
This is an area where GPT-5.2 clearly shows strong improvements. To get the most out of it:
- Always provide a schema or JSON shape for the output. You can use structured outputs for strict schema adherence.
- Distinguish between required and optional fields.
- Ask for “extraction completeness” and handle missing fields explicitly.
Example:
For multi-table/multi-file extraction, add guidance to:
- Serialize per-document results separately.
- Include a stable ID (filename, contract title, page range).
8. Prompt Migration Guide to GPT 5.2
This section helps you migrate prompts and model configs to GPT-5.2 while keeping behavior stable and cost/latency predictable. GPT-5-class models support a reasoning_effort knob (e.g., none|minimal|low|medium|high|xhigh) that trades off speed/cost vs. deeper reasoning.
Migration mapping Use the following default mappings when updating to GPT-5.2
| Current model | Target model | Target reasoning_effort | Notes |
|---|---|---|---|
| GPT-4o | GPT-5.2 | none | Treat 4o/4.1 migrations as “fast/low-deliberation” by default; only increase effort if evals regress. |
| GPT-4.1 | GPT-5.2 | none | Same mapping as GPT-4o to preserve snappy behavior. |
| GPT-5 | GPT-5.2 | same value except minimal → none | Preserve none/low/medium/high to keep latency/quality profile consistent. |
| GPT-5.1 | GPT-5.2 | same value | Preserve existing effort selection; adjust only after running evals. |
*Note that default reasoning level for GPT-5 is medium, and for GPT-5.1 and GPT-5.2 is none.
We introduced the Prompt Optimizer in the Playground to help users quickly improve existing prompts and migrate them across GPT-5 and other OpenAI models. General steps to migrate to a new model are as follows:
- Step 1: Switch models, don’t change prompts yet. Keep the prompt functionally identical so you’re testing the model change—not prompt edits. Make one change at a time.
- Step 2: Pin reasoning_effort. Explicitly set GPT-5.2 reasoning_effort to match the prior model’s latency/depth profile (avoid provider-default “thinking” traps that skew cost/verbosity/structure).
- Step 3: Run Evals for a baseline. After model + effort are aligned, run your eval suite. If results look good (often better at med/high), you’re ready to ship.
- Step 4: If regressions, tune the prompt. Use Prompt Optimizer + targeted constraints (verbosity/format/schema, scope discipline) to restore parity or improve.
- Step 5: Re-run Evals after each small change. Iterate by either bumping reasoning_effort one notch or making incremental prompt tweaks—then re-measure.
9. Web search and research
GPT-5.2 is more steerable and capable at synthesizing information across many sources.
Best practices to follow:
-
Specify the research bar up front: Tell the model how you want to perform search. Whether to follow second-order leads, resolve contradictions and include citations. Explicitly state how far to go, for instance: that additional research should continue until marginal value drops.
-
Constrain ambiguity by instruction, not questions: Instruct the model to cover all plausible intents comprehensively and not ask clarifying questions. Require breadth and depth when uncertainty exists.
-
Dictate output shape and tone: Set expectations for structure (Markdown, headers, tables for comparisons), clarity (define acronyms, concrete examples) and voice (conversational, persona-adaptive, non-sycophantic)
10. Conclusion
GPT-5.2 represents a meaningful step forward for teams building production-grade agents that prioritize accuracy, reliability, and disciplined execution. It delivers stronger instruction following, cleaner output, and more consistent behavior across complex, tool-heavy workflows. Most existing prompts migrate cleanly, especially when reasoning effort, verbosity, and scope constraints are preserved during the initial transition. Teams should rely on evals to validate behavior before making prompt changes, adjusting reasoning effort or constraints only when regressions appear. With explicit prompting and measured iteration, GPT-5.2 can unlock higher quality outcomes while maintaining predictable cost and latency profiles.
Appendix
Example prompt for a web research agent:
Introduction
GPT-5.1 is designed to balance intelligence and speed for a variety of agentic and coding tasks, while also introducing a new none reasoning mode for low-latency interactions. Building on the strengths of GPT-5, GPT-5.1 is better calibrated to prompt difficulty, consuming far fewer tokens on easy inputs and more efficiently handling challenging ones. Along with these benefits, GPT-5.1 is more steerable in personality, tone, and output formatting.
While GPT-5.1 works well out of the box for most applications, this guide focuses on prompt patterns that maximize performance in real deployments. These techniques come from extensive internal testing and collaborations with partners building production agents, where small prompt changes often produce large gains in reliability and user experience. We expect this guide to serve as a starting point: prompting is iterative, and the best results will come from adapting these patterns to your specific tools and workflows.
Migrating to GPT-5.1
For developers using GPT-4.1, GPT-5.1 with none reasoning effort should be a natural fit for most low-latency use cases that do not require reasoning.
For developers using GPT-5, we have seen strong success with customers who follow a few key pieces of guidance:
- Persistence: GPT-5.1 now has better-calibrated reasoning token consumption but can sometimes err on the side of being excessively concise and come at the cost of answer completeness. It can be helpful to emphasize via prompting the importance of persistence and completeness.
- Output formatting and verbosity: While overall more detailed, GPT-5.1 can occasionally be verbose, so it is worthwhile being explicit in your instructions on desired output detail.
- Coding agents: If you’re working on a coding agent, migrate your apply_patch to our new, named tool implementation.
- Instruction following: For other behavior issues, GPT-5.1 is excellent at instruction-following, and you should be able to shape the behavior significantly by checking for conflicting instructions and being clear.
We also released GPT-5.1-codex. That model behaves a bit differently than GPT-5.1, and we recommend you check out the Codex prompting guide for more information. The current Codex model in the API is gpt-5.2-codex (see the model page).
Agentic steerability
GPT-5.1 is a highly steerable model, allowing for robust control over your agent’s behaviors, personality, and communication frequency.
Shaping your agent’s personality
GPT-5.1’s personality and response style can be adapted to your use case. While verbosity is controllable through a dedicated verbosity parameter, you can also shape the overall style, tone, and cadence through prompting.
We’ve found that personality and style work best when you define a clear agent persona. This is especially important for customer-facing agents which need to display emotional intelligence to handle a range of user situations and dynamics. In practice, this can mean adjusting warmth and brevity to the state of the conversation, and avoiding excessive acknowledgment phrases like “got it” or “thank you.”
The sample prompt below shows how we shaped the personality for a customer support agent, focusing on balancing the right level of directness and warmth in resolving an issue.
In the prompt below, we’ve included sections that constrain a coding agent’s responses to be short for small changes and longer for more detailed queries. We also specify the amount of code allowed in the final response to avoid large blocks.
Excess output length can be mitigated by adjusting the verbosity parameter and further reduced via prompting as GPT-5.1 adheres well to concrete length guidance:
Eliciting user updates
User updates, also called preambles, are a way for GPT-5.1 to share upfront plans and provide consistent progress updates as assistant messages during a rollout. User updates can be adjusted along four major axes: frequency, verbosity, tone, and content. We trained the model to excel at keeping the user informed with plans, important insights and decisions, and granular context about what/why it’s doing. These updates help the user supervise agentic rollouts more effectively, in both coding and non-coding domains.
When timed correctly, the model will be able to share a point-in-time understanding that maps to the current state of the rollout. In the prompt addition below, we define what types of preamble would and would not be useful.
In longer-running model executions, providing a fast initial assistant message can improve perceived latency and user experience. We can achieve this behavior with GPT-5.1 through clear prompting.
Optimizing intelligence and instruction-following
GPT-5.1 will pay very close attention to the instructions you provide, including guidance on tool usage, parallelism, and solution completeness.
Encouraging complete solutions
On long agentic tasks, we’ve noticed that GPT-5.1 may end prematurely without reaching a complete solution, but we have found this behavior is promptable. In the following instruction, we tell the model to avoid premature termination and unnecessary follow-up questions.
Tool-calling format
In order to make tool-calling most effective, we recommend describing functionality in the tool definition and how/when to use tools in the prompt. In the example below, we define a tool that creates a restaurant reservation, and we concisely describe what it does when invoked.
In the prompt, you may have a section that references the tool like this:
GPT-5.1 also executes parallel tool calls more efficiently. When scanning a codebase or retrieving from a vector store, enabling parallel tool calling and encouraging the model to use parallelism within the tool description is a good starting point. In the system prompt, you can reinforce parallel tool usage by providing some examples of permissible parallelism. An example instruction may look like:
Using the “none” reasoning mode for improved efficiency
GPT-5.1 introduces a new reasoning mode: none. Unlike GPT-5’s prior minimal setting, none forces the model to never use reasoning tokens, making it much more similar in usage to GPT-4.1, GPT-4o, and other prior non-reasoning models. Importantly, developers can now use hosted tools like web search and file search with none, and custom function-calling performance is also substantially improved. With that in mind, prior guidance on prompting non-reasoning models like GPT-4.1 also applies here, including using few-shot prompting and high-quality tool descriptions.
While GPT-5.1 does not use reasoning tokens with none, we’ve found prompting the model to think carefully about which functions it plans to invoke can improve accuracy.
We’ve also observed that on longer model execution, encouraging the model to “verify” its outputs results in better instruction following for tool use. Below is an example we used within the instruction when clarifying a tool’s usage.
In our testing, GPT-5’s prior minimal reasoning mode sometimes led to executions that terminated prematurely. Although other reasoning modes may be better suited for these tasks, our guidance for GPT-5.1 with none is similar. Below is a snippet from our Tau bench prompt.
Maximizing coding performance from planning to execution
One tool we recommend implementing for long-running tasks is a planning tool. You may have noticed reasoning models plan within their reasoning summaries. Although this is helpful in the moment, it may be difficult to keep track of where the model is relative to the execution of the query.
A plan tool can be used with minimal scaffolding. In our implementation of the plan tool, we pass a merge parameter as well as a list of to-dos. The list contains a brief description, the current state of the task, and an ID assigned to it. Below is an example of a function call that GPT-5.1 may make to record its state.
Design system enforcement
When building frontend interfaces, GPT-5.1 can be steered to produce websites that match your visual design system. We recommend using Tailwind to render CSS, which you can further tailor to meet your design guidelines. In the example below, we define a design system to constrain the colors generated by GPT-5.1.
New tool types in GPT-5.1
GPT-5.1 has been post-trained on specific tools that are commonly used in coding use cases. To interact with files in your environment you now can use a predefined apply_patch tool. Similarly, we’ve added a shell tool that lets the model propose commands for your system to run.
Using apply_patch
The apply_patch tool lets GPT-5.1 create, update, and delete files in your codebase using structured diffs. Instead of just suggesting edits, the model emits patch operations that your application applies and then reports back on, enabling iterative, multi-step code editing workflows. You can find additional usage details and context in the GPT-4.1 prompting guide.
With GPT-5.1, you can use apply_patch as a new tool type without writing custom descriptions for the tool. The description and handling are managed via the Responses API. Under the hood, this implementation uses a freeform function call rather than a JSON format. In testing, the named function decreased apply_patch failure rates by 35%.
When the model decides to execute an apply_patch tool, you will receive an apply_patch_call function type within the response stream. Within the operation object, you’ll receive a type field (with one of create_file, update_file, or delete_file) and the diff to implement.
This repository contains the expected implementation for the apply_patch tool executable. When your system finishes executing the patch tool, the Responses API expects a tool output in the following form:
Using the shell tool
We’ve also built a new shell tool for GPT-5.1. The shell tool allows the model to interact with your local computer through a controlled command-line interface. The model proposes shell commands; your integration executes them and returns the outputs. This creates a simple plan-execute loop that lets models inspect the system, run utilities, and gather data until they finish the task.
The shell tool is invoked in the same way as apply_patch: include it as a tool of type shell.
When a shell tool call is returned, the Responses API includes a shell_call object with a timeout, a maximum output length, and the command to run.
After executing the shell command, return the untruncated stdout/stderr logs as well as the exit-code details.
How to metaprompt effectively
Building prompts can be cumbersome, but it’s also the highest-leverage thing you can do to resolve most model behavior issues. Small inclusions can unexpectedly steer the model undesirably. Let’s walk through an example of an agent that plans events. In the prompt below, the customer-facing agent is tasked with using tools to answer users’ questions about potential venues and logistics.
Although this is a strong starting prompt, there are a few issues we noticed upon testing:
-
Small conceptual questions (like asking about a 20-person leadership dinner) triggered unnecessary tool calls and very concrete venue suggestions, despite the prompt allowing internal knowledge for simple, high-level questions.
-
The agent oscillated between being overly verbose (multi-day Austin offsites turning into dense, multi-section essays) and overly hesitant (refusing to propose a plan without more questions) and occasionally ignored unit rules (a Berlin summit described in miles and °F instead of km and °C).
Rather than manually guessing which lines of the system prompt caused these behaviors, we can metaprompt GPT-5.1 to inspect its own instructions and traces.
Step 1: Ask GPT-5.1 to diagnose failures
Paste the system prompt and a small batch of failure examples into a separate analysis call. Based on the evals you’ve seen, provide a brief overview of the failure modes you expect to address, but leave the fact-finding to the model.
Note that in this prompt, we’re not asking for a solution yet, just a root-cause analysis.
Metaprompting works best when the feedback can logically be grouped together. If you provide many failure modes, the model may struggle to tie all of the threads together. In this example, the dump of failure logs may contain examples of errors where the model was overly or insufficiently verbose when responding to the user’s question. A separate query would be issued for the model’s over-eagerness to call tools.
Step 2: Ask GPT-5.1 how it would patch the prompt to fix those behaviors
Once you have that analysis, you can run a second, separate call that focuses on implementation: tightening the prompt without fully rewriting it.
In this example, the first metaprompt helps GPT-5.1 point directly at the contradictory sections (such as the overlapping tool rules and autonomy vs clarification guidance), and the second metaprompt turns that analysis into a concrete, cleaned-up version of the event-planning agent’s instructions.
The output from the second prompt might look something like this:
After this iteration cycle, run the queries again to observe any regressions and repeat this process until your failure modes have been identified and triaged.
As you continue to grow your agentic systems (e.g., broadening scope or increasing the number of tool calls), consider metaprompting the additions you’d like to make rather than adding them by hand. This helps maintain discrete boundaries for each tool and when they should be used.
What’s next
To summarize, GPT-5.1 builds on the foundation set by GPT-5 and adds things like quicker thinking for easy questions, steerability when it comes to model output, new tools for coding use cases, and the option to set reasoning to none when your tasks don’t require heavy thinking.
Get started with GPT-5.1 in the docs, or read the blog post to learn more.
GPT-5 represents a substantial leap forward in agentic task performance, coding, raw intelligence, and steerability.
While we trust it will perform excellently “out of the box” across a wide range of domains, in this guide we’ll cover prompting tips to maximize the quality of model outputs, derived from our experience training and applying the model to real-world tasks. We discuss concepts like improving agentic task performance, ensuring instruction adherence, making use of newly API features, and optimizing coding for frontend and software engineering tasks - with key insights into AI code editor Cursor’s prompt tuning work with GPT-5.
We’ve seen significant gains from applying these best practices and adopting our canonical tools whenever possible, and we hope that this guide, along with the prompt optimizer tool we’ve built, will serve as a launchpad for your use of GPT-5. But, as always, remember that prompting is not a one-size-fits-all exercise - we encourage you to run experiments and iterate on the foundation offered here to find the best solution for your problem.
Agentic workflow predictability
We trained GPT-5 with developers in mind: we’ve focused on improving tool calling, instruction following, and long-context understanding to serve as the best foundation model for agentic applications. If adopting GPT-5 for agentic and tool calling flows, we recommend upgrading to the Responses API, where reasoning is persisted between tool calls, leading to more efficient and intelligent outputs.
Controlling agentic eagerness
Agentic scaffolds can span a wide spectrum of control—some systems delegate the vast majority of decision-making to the underlying model, while others keep the model on a tight leash with heavy programmatic logical branching. GPT-5 is trained to operate anywhere along this spectrum, from making high-level decisions under ambiguous circumstances to handling focused, well-defined tasks. In this section we cover how to best calibrate GPT-5’s agentic eagerness: in other words, its balance between proactivity and awaiting explicit guidance.
Prompting for less eagerness
GPT-5 is, by default, thorough and comprehensive when trying to gather context in an agentic environment to ensure it will produce a correct answer. To reduce the scope of GPT-5’s agentic behavior—including limiting tangential tool-calling action and minimizing latency to reach a final answer—try the following:
- Switch to a lower
reasoning_effort. This reduces exploration depth but improves efficiency and latency. Many workflows can be accomplished with consistent results at medium or even lowreasoning_effort. - Define clear criteria in your prompt for how you want the model to explore the problem space. This reduces the model’s need to explore and reason about too many ideas:
If you’re willing to be maximally prescriptive, you can even set fixed tool call budgets, like the one below. The budget can naturally vary based on your desired search depth.
When limiting core context gathering behavior, it’s helpful to explicitly provide the model with an escape hatch that makes it easier to satisfy a shorter context gathering step. Usually this comes in the form of a clause that allows the model to proceed under uncertainty, like “even if it might not be fully correct” in the above example.
Prompting for more eagerness
On the other hand, if you’d like to encourage model autonomy, increase tool-calling persistence, and reduce occurrences of clarifying questions or otherwise handing back to the user, we recommend increasing reasoning_effort, and using a prompt like the following to encourage persistence and thorough task completion:
Generally, it can be helpful to clearly state the stop conditions of the agentic tasks, outline safe versus unsafe actions, and define when, if ever, it’s acceptable for the model to hand back to the user. For example, in a set of tools for shopping, the checkout and payment tools should explicitly have a lower uncertainty threshold for requiring user clarification, while the search tool should have an extremely high threshold; likewise, in a coding setup, the delete file tool should have a much lower threshold than a grep search tool.
Tool preambles
We recognize that on agentic trajectories monitored by users, intermittent model updates on what it’s doing with its tool calls and why can provide for a much better interactive user experience - the longer the rollout, the bigger the difference these updates make. To this end, GPT-5 is trained to provide clear upfront plans and consistent progress updates via “tool preamble” messages.
You can steer the frequency, style, and content of tool preambles in your prompt—from detailed explanations of every single tool call to a brief upfront plan and everything in between. This is an example of a high-quality preamble prompt:
Here’s an example of a tool preamble that might be emitted in response to such a prompt—such preambles can drastically improve the user’s ability to follow along with your agent’s work as it grows more complicated:
Reasoning effort
We provide a reasoning_effort parameter to control how hard the model thinks and how willingly it calls tools; the default is medium, but you should scale up or down depending on the difficulty of your task. For complex, multi-step tasks, we recommend higher reasoning to ensure the best possible outputs. Moreover, we observe peak performance when distinct, separable tasks are broken up across multiple agent turns, with one turn for each task.
Reusing reasoning context with the Responses API
We strongly recommend using the Responses API when using GPT-5 to unlock improved agentic flows, lower costs, and more efficient token usage in your applications.
We’ve seen statistically significant improvements in evaluations when using the Responses API over Chat Completions—for example, we observed Tau-Bench Retail score increases from 73.9% to 78.2% just by switching to the Responses API and including previous_response_id to pass back previous reasoning items into subsequent requests. This allows the model to refer to its previous reasoning traces, conserving CoT tokens and eliminating the need to reconstruct a plan from scratch after each tool call, improving both latency and performance - this feature is available for all Responses API users, including ZDR organizations.
Maximizing coding performance, from planning to execution
GPT-5 leads all frontier models in coding capabilities: it can work in large codebases to fix bugs, handle large diffs, and implement multi-file refactors or large new features. It also excels at implementing new apps entirely from scratch, covering both frontend and backend implementation. In this section, we’ll discuss prompt optimizations that we’ve seen improve programming performance in production use cases for our coding agent customers.
Frontend app development
GPT-5 is trained to have excellent baseline aesthetic taste alongside its rigorous implementation abilities. We’re confident in its ability to use all types of web development frameworks and packages; however, for new apps, we recommend using the following frameworks and packages to get the most out of the model’s frontend capabilities:
- Frameworks: Next.js (TypeScript), React, HTML
- Styling / UI: Tailwind CSS, shadcn/ui, Radix Themes
- Icons: Material Symbols, Heroicons, Lucide
- Animation: Motion
- Fonts: San Serif, Inter, Geist, Mona Sans, IBM Plex Sans, Manrope
Zero-to-one app generation
GPT-5 is excellent at building applications in one shot. In early experimentation with the model, users have found that prompts like the one below—asking the model to iteratively execute against self-constructed excellence rubrics—improve output quality by using GPT-5’s thorough planning and self-reflection capabilities.
Matching codebase design standards
When implementing incremental changes and refactors in existing apps, model-written code should adhere to existing style and design standards, and “blend in” to the codebase as neatly as possible. Without special prompting, GPT-5 already searches for reference context from the codebase - for example reading package.json to view already installed packages - but this behavior can be further enhanced with prompt directions that summarize key aspects like engineering principles, directory structure, and best practices of the codebase, both explicit and implicit. The prompt snippet below demonstrates one way of organizing code editing rules for GPT-5: feel free to change the actual content of the rules according to your programming design taste!
Collaborative coding in production: Cursor’s GPT-5 prompt tuning
We’re proud to have had AI code editor Cursor as a trusted alpha tester for GPT-5: below, we show a peek into how Cursor tuned their prompts to get the most out of the model’s capabilities. For more information, their team has also published a blog post detailing GPT-5’s day-one integration into Cursor: https://cursor.com/blog/gpt-5
System prompt and parameter tuning
Cursor’s system prompt focuses on reliable tool calling, balancing verbosity and autonomous behavior while giving users the ability to configure custom instructions. Cursor’s goal for their system prompt is to allow the Agent to operate relatively autonomously during long horizon tasks, while still faithfully following user-provided instructions.
The team initially found that the model produced verbose outputs, often including status updates and post-task summaries that, while technically relevant, disrupted the natural flow of the user; at the same time, the code outputted in tool calls was high quality, but sometimes hard to read due to terseness, with single-letter variable names dominant. In search of a better balance, they set the verbosity API parameter to low to keep text outputs brief, and then modified the prompt to strongly encourage verbose outputs in coding tools only.
This dual usage of parameter and prompt resulted in a balanced format combining efficient, concise status updates and final work summary with much more readable code diffs.
Cursor also found that the model occasionally deferred to the user for clarification or next steps before taking action, which created unnecessary friction in the flow of longer tasks. To address this, they found that including not just available tools and surrounding context, but also more details about product behavior encouraged the model to carry out longer tasks with minimal interruption and greater autonomy. Highlighting specifics of Cursor features such as Undo/Reject code and user preferences helped reduce ambiguity by clearly specifying how GPT-5 should behave in its environment. For longer horizon tasks, they found this prompt improved performance:
Cursor found that sections of their prompt that had been effective with earlier models needed tuning to get the most out of GPT-5. Here is one example below:
While this worked well with older models that needed encouragement to analyze context thoroughly, they found it counterproductive with GPT-5, which is already naturally introspective and proactive at gathering context. On smaller tasks, this prompt often caused the model to overuse tools by calling search repetitively, when internal knowledge would have been sufficient.
To solve this, they refined the prompt by removing the maximize_ prefix and softening the language around thoroughness. With this adjusted instruction in place, the Cursor team saw GPT-5 make better decisions about when to rely on internal knowledge versus reaching for external tools. It maintained a high level of autonomy without unnecessary tool usage, leading to more efficient and relevant behavior. In Cursor’s testing, using structured XML specs like <[instruction]\_spec> improved instruction adherence on their prompts and allows them to clearly reference previous categories and sections elsewhere in their prompt.
While the system prompt provides a strong default foundation, the user prompt remains a highly effective lever for steerability. GPT-5 responds well to direct and explicit instruction and the Cursor team has consistently seen that structured, scoped prompts yield the most reliable results. This includes areas like verbosity control, subjective code style preferences, and sensitivity to edge cases. Cursor found allowing users to configure their own custom Cursor rules to be particularly impactful with GPT-5’s improved steerability, giving their users a more customized experience.
Optimizing intelligence and instruction-following
Steering
As our most steerable model yet, GPT-5 is extraordinarily receptive to prompt instructions surrounding verbosity, tone, and tool calling behavior.
Verbosity
In addition to being able to control the reasoning_effort as in previous reasoning models, in GPT-5 we introduce a new API parameter called verbosity, which influences the length of the model’s final answer, as opposed to the length of its thinking. Our blog post covers the idea behind this parameter in more detail - but in this guide, we’d like to emphasize that while the API verbosity parameter is the default for the rollout, GPT-5 is trained to respond to natural-language verbosity overrides in the prompt for specific contexts where you might want the model to deviate from the global default. Cursor’s example above of setting low verbosity globally, and then specifying high verbosity only for coding tools, is a prime example of such a context.
Instruction following
Like GPT-4.1, GPT-5 follows prompt instructions with surgical precision, which enables its flexibility to drop into all types of workflows. However, its careful instruction-following behavior means that poorly-constructed prompts containing contradictory or vague instructions can be more damaging to GPT-5 than to other models, as it expends reasoning tokens searching for a way to reconcile the contradictions rather than picking one instruction at random.
Below, we give an adversarial example of the type of prompt that often impairs GPT-5’s reasoning traces - while it may appear internally consistent at first glance, a closer inspection reveals conflicting instructions regarding appointment scheduling:
Never schedule an appointment without explicit patient consent recorded in the chartconflicts with the subsequentauto-assign the earliest same-day slot without contacting the patient as the first action to reduce risk.- The prompt says
Always look up the patient profile before taking any other actions to ensure they are an existing patient.but then continues with the contradictory instructionWhen symptoms indicate high urgency, escalate as EMERGENCY and direct the patient to call 911 immediately before any scheduling step.
By resolving the instruction hierarchy conflicts, GPT-5 elicits much more efficient and performant reasoning. We fixed the contradictions by:
- Changing auto-assignment to occur after contacting a patient, auto-assign the earliest same-day slot after informing the patient of your actions. to be consistent with only scheduling with consent.
- Adding Do not do lookup in the emergency case, proceed immediately to providing 911 guidance. to let the model know it is ok to not look up in case of emergency.
We understand that the process of building prompts is an iterative one, and many prompts are living documents constantly being updated by different stakeholders - but this is all the more reason to thoroughly review them for poorly-worded instructions. Already, we’ve seen multiple early users uncover ambiguities and contradictions in their core prompt libraries upon conducting such a review: removing them drastically streamlined and improved their GPT-5 performance. We recommend testing your prompts in our prompt optimizer tool to help identify these types of issues.
Minimal reasoning
In GPT-5, we introduce minimal reasoning effort for the first time: our fastest option that still reaps the benefits of the reasoning model paradigm. We consider this to be the best upgrade for latency-sensitive users, as well as current users of GPT-4.1.
Perhaps unsurprisingly, we recommend prompting patterns that are similar to GPT-4.1 for best results. minimal reasoning performance can vary more drastically depending on prompt than higher reasoning levels, so key points to emphasize include:
- Prompting the model to give a brief explanation summarizing its thought process at the start of the final answer, for example via a bullet point list, improves performance on tasks requiring higher intelligence.
- Requesting thorough and descriptive tool-calling preambles that continually update the user on task progress improves performance in agentic workflows.
- Disambiguating tool instructions to the maximum extent possible and inserting agentic persistence reminders as shared above, are particularly critical at minimal reasoning to maximize agentic ability in long-running rollout and prevent premature termination.
- Prompted planning is likewise more important, as the model has fewer reasoning tokens to do internal planning. Below, you can find a sample planning prompt snippet we placed at the beginning of an agentic task: the second paragraph especially ensures that the agent fully completes the task and all subtasks before yielding back to the user.
Markdown formatting
By default, GPT-5 in the API does not format its final answers in Markdown, in order to preserve maximum compatibility with developers whose applications may not support Markdown rendering. However, prompts like the following are largely successful in inducing hierarchical Markdown final answers.
Occasionally, adherence to Markdown instructions specified in the system prompt can degrade over the course of a long conversation. In the event that you experience this, we’ve seen consistent adherence from appending a Markdown instruction every 3-5 user messages.
Metaprompting
Finally, to close with a meta-point, early testers have found great success using GPT-5 as a meta-prompter for itself. Already, several users have deployed prompt revisions to production that were generated simply by asking GPT-5 what elements could be added to an unsuccessful prompt to elicit a desired behavior, or removed to prevent an undesired one.
Here is an example metaprompt template we liked:
Appendix
SWE-Bench verified developer instructions
Agentic coding tool definitions
As shared in the GPT-4.1 prompting guide, here is our most updated apply_patch implementation: we highly recommend using apply_patch for file edits to match the training distribution. The newest implementation should match the GPT-4.1 implementation in the overwhelming majority of cases.
Taubench-Retail minimal reasoning instructions
Terminal-Bench prompt
The GPT-4.1 family of models represents a significant step forward from GPT-4o in capabilities across coding, instruction following, and long context. In this prompting guide, we collate a series of important prompting tips derived from extensive internal testing to help developers fully leverage the improved abilities of this new model family.
Many typical best practices still apply to GPT-4.1, such as providing context examples, making instructions as specific and clear as possible, and inducing planning via prompting to maximize model intelligence. However, we expect that getting the most out of this model will require some prompt migration. GPT-4.1 is trained to follow instructions more closely and more literally than its predecessors, which tended to more liberally infer intent from user and system prompts. This also means, however, that GPT-4.1 is highly steerable and responsive to well-specified prompts - if model behavior is different from what you expect, a single sentence firmly and unequivocally clarifying your desired behavior is almost always sufficient to steer the model on course.
Please read on for prompt examples you can use as a reference, and remember that while this guidance is widely applicable, no advice is one-size-fits-all. AI engineering is inherently an empirical discipline, and large language models are inherently nondeterministic; in addition to following this guide, we advise building informative evals and iterating often to ensure your prompt engineering changes are yielding benefits for your use case.
1. Agentic Workflows
GPT-4.1 is a great place to build agentic workflows. In model training we emphasized providing a diverse range of agentic problem-solving trajectories, and our agentic harness for the model achieves state-of-the-art performance for non-reasoning models on SWE-bench Verified, solving 55% of problems.
System Prompt Reminders
In order to fully utilize the agentic capabilities of GPT-4.1, we recommend including three key types of reminders in all agent prompts. The following prompts are optimized specifically for the agentic coding workflow, but can be easily modified for general agentic use cases.
- Persistence: this ensures the model understands it is entering a multi-message turn, and prevents it from prematurely yielding control back to the user. Our example is the following:
- Tool-calling: this encourages the model to make full use of its tools, and reduces its likelihood of hallucinating or guessing an answer. Our example is the following:
- Planning [optional]: if desired, this ensures the model explicitly plans and reflects upon each tool call in text, instead of completing the task by chaining together a series of only tool calls. Our example is the following:
GPT-4.1 is trained to respond very closely to both user instructions and system prompts in the agentic setting. The model adhered closely to these three simple instructions and increased our internal SWE-bench Verified score by close to 20% - so we highly encourage starting any agent prompt with clear reminders covering the three categories listed above. As a whole, we find that these three instructions transform the model from a chatbot-like state into a much more “eager” agent, driving the interaction forward autonomously and independently.
Tool Calls
Compared to previous models, GPT-4.1 has undergone more training on effectively utilizing tools passed as arguments in an OpenAI API request. We encourage developers to exclusively use the tools field to pass tools, rather than manually injecting tool descriptions into your prompt and writing a separate parser for tool calls, as some have reported doing in the past. This is the best way to minimize errors and ensure the model remains in distribution during tool-calling trajectories - in our own experiments, we observed a 2% increase in SWE-bench Verified pass rate when using API-parsed tool descriptions versus manually injecting the schemas into the system prompt.
Developers should name tools clearly to indicate their purpose and add a clear, detailed description in the “description” field of the tool. Similarly, for each tool param, lean on good naming and descriptions to ensure appropriate usage. If your tool is particularly complicated and you’d like to provide examples of tool usage, we recommend that you create an # Examples section in your system prompt and place the examples there, rather than adding them into the “description’ field, which should remain thorough but relatively concise. Providing examples can be helpful to indicate when to use tools, whether to include user text alongside tool calls, and what parameters are appropriate for different inputs. Remember that you can use “Generate Anything” in the Prompt Playground to get a good starting point for your new tool definitions.
Prompting-Induced Planning & Chain-of-Thought
As mentioned already, developers can optionally prompt agents built with GPT-4.1 to plan and reflect between tool calls, instead of silently calling tools in an unbroken sequence. GPT-4.1 is not a reasoning model - meaning that it does not produce an internal chain of thought before answering - but in the prompt, a developer can induce the model to produce an explicit, step-by-step plan by using any variant of the Planning prompt component shown above. This can be thought of as the model “thinking out loud.” In our experimentation with the SWE-bench Verified agentic task, inducing explicit planning increased the pass rate by 4%.
Sample Prompt: SWE-bench Verified
Below, we share the agentic prompt that we used to achieve our highest score on SWE-bench Verified, which features detailed instructions about workflow and problem-solving strategy. This general pattern can be used for any agentic task.
2. Long context
GPT-4.1 has a performant 1M token input context window, and is useful for a variety of long context tasks, including structured document parsing, re-ranking, selecting relevant information while ignoring irrelevant context, and performing multi-hop reasoning using context.
Optimal Context Size
We observe very good performance on needle-in-a-haystack evaluations up to our full 1M token context, and we’ve observed very strong performance at complex tasks with a mix of both relevant and irrelevant code and other documents. However, long context performance can degrade as more items are required to be retrieved, or perform complex reasoning that requires knowledge of the state of the entire context (like performing a graph search, for example).
Tuning Context Reliance
Consider the mix of external vs. internal world knowledge that might be required to answer your question. Sometimes it’s important for the model to use some of its own knowledge to connect concepts or make logical jumps, while in others it’s desirable to only use provided context
Prompt Organization
Especially in long context usage, placement of instructions and context can impact performance. If you have long context in your prompt, ideally place your instructions at both the beginning and end of the provided context, as we found this to perform better than only above or below. If you’d prefer to only have your instructions once, then above the provided context works better than below.
3. Chain of Thought
As mentioned above, GPT-4.1 is not a reasoning model, but prompting the model to think step by step (called “chain of thought”) can be an effective way for a model to break down problems into more manageable pieces, solve them, and improve overall output quality, with the tradeoff of higher cost and latency associated with using more output tokens. The model has been trained to perform well at agentic reasoning about and real-world problem solving, so it shouldn’t require much prompting to perform well.
We recommend starting with this basic chain-of-thought instruction at the end of your prompt:
From there, you should improve your chain-of-thought (CoT) prompt by auditing failures in your particular examples and evals, and addressing systematic planning and reasoning errors with more explicit instructions. In the unconstrained CoT prompt, there may be variance in the strategies it tries, and if you observe an approach that works well, you can codify that strategy in your prompt. Generally speaking, errors tend to occur from misunderstanding user intent, insufficient context gathering or analysis, or insufficient or incorrect step by step thinking, so watch out for these and try to address them with more opinionated instructions.
Here is an example prompt instructing the model to focus more methodically on analyzing user intent and considering relevant context before proceeding to answer.
4. Instruction Following
GPT-4.1 exhibits outstanding instruction-following performance, which developers can leverage to precisely shape and control the outputs for their particular use cases. Developers often extensively prompt for agentic reasoning steps, response tone and voice, tool calling information, output formatting, topics to avoid, and more. However, since the model follows instructions more literally, developers may need to include explicit specification around what to do or not to do. Furthermore, existing prompts optimized for other models may not immediately work with this model, because existing instructions are followed more closely and implicit rules are no longer being as strongly inferred.
Recommended Workflow
Here is our recommended workflow for developing and debugging instructions in prompts:
- Start with an overall “Response Rules” or “Instructions” section with high-level guidance and bullet points.
- If you’d like to change a more specific behavior, add a section to specify more details for that category, like
# Sample Phrases. - If there are specific steps you’d like the model to follow in its workflow, add an ordered list and instruct the model to follow these steps.
- If behavior still isn’t working as expected:
- Check for conflicting, underspecified, or wrong instructions and examples. If there are conflicting instructions, GPT-4.1 tends to follow the one closer to the end of the prompt.
- Add examples that demonstrate desired behavior; ensure that any important behavior demonstrated in your examples are also cited in your rules.
- It’s generally not necessary to use all-caps or other incentives like bribes or tips. We recommend starting without these, and only reaching for these if necessary for your particular prompt. Note that if your existing prompts include these techniques, it could cause GPT-4.1 to pay attention to it too strictly.
Note that using your preferred AI-powered IDE can be very helpful for iterating on prompts, including checking for consistency or conflicts, adding examples, or making cohesive updates like adding an instruction and updating instructions to demonstrate that instruction.
Common Failure Modes
These failure modes are not unique to GPT-4.1, but we share them here for general awareness and ease of debugging.
- Instructing a model to always follow a specific behavior can occasionally induce adverse effects. For instance, if told “you must call a tool before responding to the user,” models may hallucinate tool inputs or call the tool with null values if they do not have enough information. Adding “if you don’t have enough information to call the tool, ask the user for the information you need” should mitigate this.
- When provided sample phrases, models can use those quotes verbatim and start to sound repetitive to users. Ensure you instruct the model to vary them as necessary.
- Without specific instructions, some models can be eager to provide additional prose to explain their decisions, or output more formatting in responses than may be desired. Provide instructions and potentially examples to help mitigate.
Example Prompt: Customer Service
This demonstrates best practices for a fictional customer service agent. Observe the diversity of rules, the specificity, the use of additional sections for greater detail, and an example to demonstrate precise behavior that incorporates all prior rules.
Try running the following notebook cell - you should see both a user message and tool call, and the user message should start with a greeting, then echo back their answer, then mention they’re about to call a tool. Try changing the instructions to shape the model behavior, or trying other user messages, to test instruction following performance.
5. General Advice
Prompt Structure
For reference, here is a good starting point for structuring your prompts.
Add or remove sections to suit your needs, and experiment to determine what’s optimal for your usage.
Delimiters
Here are some general guidelines for selecting the best delimiters for your prompt. Please refer to the Long Context section for special considerations for that context type.
- Markdown: We recommend starting here, and using markdown titles for major sections and subsections (including deeper hierarchy, to H4+). Use inline backticks or backtick blocks to precisely wrap code, and standard numbered or bulleted lists as needed.
- XML: These also perform well, and we have improved adherence to information in XML with this model. XML is convenient to precisely wrap a section including start and end, add metadata to the tags for additional context, and enable nesting. Here is an example of using XML tags to nest examples in an example section, with inputs and outputs for each:
- JSON is highly structured and well understood by the model particularly in coding contexts. However it can be more verbose, and require character escaping that can add overhead.
Guidance specifically for adding a large number of documents or files to input context:
- XML performed well in our long context testing.
- Example:
<doc id='1' title='The Fox'>The quick brown fox jumps over the lazy dog</doc>
- Example:
- This format, proposed by Lee et al. (ref), also performed well in our long context testing.
- Example:
ID: 1 | TITLE: The Fox | CONTENT: The quick brown fox jumps over the lazy dog
- Example:
- JSON performed particularly poorly.
- Example:
[{'id': 1, 'title': 'The Fox', 'content': 'The quick brown fox jumped over the lazy dog'}]
- Example:
The model is trained to robustly understand structure in a variety of formats. Generally, use your judgement and think about what will provide clear information and “stand out” to the model. For example, if you’re retrieving documents that contain lots of XML, an XML-based delimiter will likely be less effective.
Caveats
- In some isolated cases we have observed the model being resistant to producing very long, repetitive outputs, for example, analyzing hundreds of items one by one. If this is necessary for your use case, instruct the model strongly to output this information in full, and consider breaking down the problem or using a more concise approach.
- We have seen some rare instances of parallel tool calls being incorrect. We advise testing this, and considering setting the parallel_tool_calls param to false if you’re seeing issues.
Appendix: Generating and Applying File Diffs
Developers have provided us feedback that accurate and well-formed diff generation is a critical capability to power coding-related tasks. To this end, the GPT-4.1 family features substantially improved diff capabilities relative to previous GPT models. Moreover, while GPT-4.1 has strong performance generating diffs of any format given clear instructions and examples, we open-source here one recommended diff format, on which the model has been extensively trained. We hope that in particular for developers just starting out, that this will take much of the guesswork out of creating diffs yourself.
Apply Patch
See the example below for a prompt that applies our recommended tool call correctly.
Reference Implementation: apply_patch.py
Here’s a reference implementation of the apply_patch tool that we used as part of model training. You’ll need to make this an executable and available as `apply_patch` from the shell where the model will execute commands:
Other Effective Diff Formats
If you want to try using a different diff format, we found in testing that the SEARCH/REPLACE diff format used in Aider’s polyglot benchmark, as well as a pseudo-XML format with no internal escaping, both had high success rates.
These diff formats share two key aspects: (1) they do not use line numbers, and (2) they provide both the exact code to be replaced, and the exact code with which to replace it, with clear delimiters between the two.
