VOOZH about

URL: https://developers.openai.com/api/docs/guides/realtime-models-prompting

⇱ Using realtime models | OpenAI API


Search the API docs

Primary navigation

Evaluation

Legacy APIs

gpt-realtime-2 is our state-of-the-art reasoning voice model for low-latency speech-to-speech applications. It can think before it speaks, follow instructions more reliably, use a larger context window, and call tools with greater precision than earlier realtime models.

To take advantage of these gains, design prompts with more intent. Define the assistant’s responsibilities, decision points, tool-calling behavior, and guardrails clearly: what it should do, when it should do it, and what it should avoid.

Start simple. Do not over-prompt upfront. Begin with a minimal prompt, run evaluations, then add instructions only for behaviors that fail in testing.

Choose a model

ModelUse whenPrompting focus
gpt-realtime-2

You need the strongest realtime reasoning, tool use, and instruction following.

Tune reasoning effort, preambles, tool policies, exact entity capture, and long-session state.

gpt-realtime-1.5You need a fast, reliable non-reasoning speech-to-speech model.

Follow the core realtime prompt structure and test for latency-sensitive behavior.

Realtime 2.0 Prompting Guide

Use gpt-realtime-2 when the voice agent needs stronger reasoning, tool selection, exact entity handling, or long-session state. Start with reasoning.effort: “low”, test default preamble behavior, and define clear confirmation boundaries before write actions.

What changed in Realtime 2

Prompt Realtime 2 as a reasoning voice agent, not as a basic voice bot.

ChangeWhat it means for prompts
ReasoningAllow the model to reason internally for complex tasks before speaking or calling tools. Use preambles to avoid awkward silence or unnecessary filler.
Prompt precision matters moreReplace broad guidance like “be helpful” with clear trigger, action, and exception rules: when to act, what to do, and when not to do it.
Instruction conflicts are more costlyRemove overlapping always, never, only, and must rules unless they are truly required. Define priority when rules compete.
Tool behavior is more steerableSpecify when the assistant should act immediately, ask for missing information, confirm high-precision details, retry after failure, or escalate.
Preambles are first-class behaviorThe model may speak brief updates before longer reasoning or tool-use flows. Steer when preambles should appear, how short they should be, and when to skip them.
Expanded context windowgpt-realtime-2 expands the realtime context window from 32k to 128k tokens, making it better suited for long sessions and larger system prompts.

Preambles aren’t hidden chain-of-thought. They’re short spoken updates such as “I’ll check that order now.” Don’t ask the model to reveal private reasoning.

Recommended prompt structure

Use short, labeled sections. The model should be able to find the relevant instructions quickly.

Not every use case needs every section. Add the sections that are relevant for your product.

Set reasoning effort

gpt-realtime-2 can trade latency for deeper reasoning. Use the lowest reasoning level that still gives the assistant enough intelligence for the workflow.

Start with low for most production voice agents. Tune up or down based on task complexity, latency tolerance, and failure cost.

EffortUse whenExample
minimalLowest latency matters most and the task is simple.Smart-home commands, timers, simple calendar checks.
lowYou need responsiveness plus basic reasoning.Customer support, order lookup, simple policy questions.
mediumThe assistant must reason through multi-step tasks.Technical support, diagnostics, complex routing.
highDeeper reasoning materially improves success.High-precision workflows, escalation decisions, tasks with constraints.
xhighMaximum reasoning is worth added latency and cost.Complex planning, critical triage, high-stakes tool orchestration.

Beyond the API setting, steer the model on when and how much to reason.

Use preambles intentionally

Preambles are short spoken updates that keep a voice agent feeling responsive while it reasons, looks something up, or calls a tool. Used well, they reassure the user that the assistant is working. Used poorly, they become filler and increase perceived latency.

gpt-realtime-2 generates preambles by default. Start by testing the default behavior. If it does not match your product experience, tune it explicitly.

👁 Preamble generation and playback timeline

Control response length

gpt-realtime-2 follows length guidance best when the prompt specifies how much detail to give for each task type. Instead of telling the model to “be concise,” define what concise means in context: direct answers, tool results, troubleshooting, comparisons, and escalations may each need different response lengths.

Example:

User: Which plan should I choose?

Assistant: If you want the lowest cost, choose Basic. If you need team permissions and shared billing, choose Pro. If compliance review or admin controls matter, choose Enterprise.

Design tool behavior

gpt-realtime-2 is stronger at tool calling, but tool behavior still depends on prompt and tool-spec design. If the prompt does not define when to act, ask, confirm, or recover, the assistant may call tools too early, ask unnecessary questions, or repeat failed calls.

Set tool-call eagerness

High eagerness works well for read-only, low-risk actions. Low eagerness is better when tools modify data, trigger external effects, or depend on exact identifiers.

Tool typeDefault behavior
Read-only, low-risk lookupCall when intent and required fields are clear.
Read-only with exact identifierConfirm the identifier before lookup.
User-visible communicationDraft or summarize before sending.
Account changesConfirm before calling.
Purchases, cancellations, paymentsConfirm amount, target, and consequence before calling.
Irreversible or high-impact actionsConfirm explicitly and offer escalation when appropriate.

Use this balanced default when you have a mix of read and write actions. Tailor it based on your use case.

High-risk example:

User: Charge my card for the remaining balance.

Bad:

Assistant: I’ve charged your card.

Good:

Assistant: To confirm, you want me to charge the card on file $248.16 for the remaining balance. Should I proceed?

Recover from tool failures

Tool failures are part of the conversation. A good recovery should explain what happened and give the user a clear next step.

Do not treat every failure the same. Recovery behavior should depend on the tool type, failure mode, and user impact. Some failures should be handled silently with a retry. Others require asking the user to clarify, correct an identifier, confirm a new action, or choose an alternate path.

Bad:

Assistant: Something went wrong.

Good:

Assistant: I couldn’t find a match for O R D dash 3 1 2 5 B 2 3. Did I get any part of that wrong?

Keep tool availability synchronized

Realtime models are eager to help. If the prompt mentions a tool that is not actually available, or if the tool list does not match the prompt, the model may invent a tool name or pretend it completed the action.

For example, if the prompt references lookup_order, but the provided tool is named search_orders, the model may call the wrong name or simulate the action.

Use the prompt audit meta prompt in the appendix to review production prompts for contradictions, missing tools, and brittle instructions.

Handle silence and background audio

Voice agents tend to respond by default. In production, they often hear audio that should not receive a spoken response, such as silence, background noise, hold music, TV audio, or side conversations.

Use a no-op wait tool when the assistant should stay quiet and keep listening. The tool gives the model a valid non-speaking action instead of making it say things like “I’m here” or “I didn’t catch that.”

Tool design:

Pair it with prompt instructions:

Use this for non-addressed audio, not for unclear user requests. If the user is clearly speaking to the assistant but the content is unintelligible, ask for clarification instead.

Use message channels deliberately

gpt-realtime-2 can produce user-visible intermediate messages in the commentary channel and final user-facing responses in the final channel. Use channel-specific instructions when the behavior depends on where it appears.

ChannelUser-visible?Used for
commentaryYesPreambles and tool calls.
finalYesFinal user-facing message.

For example, tool calls happen in the commentary channel. If you want the assistant to say something before, during, or after tool use, specify that behavior in relation to the commentary channel.

gpt-realtime-2 can emit multiple response phases in a single turn. In API output, this distinction is represented by the response.done event, which includes a phase value that indicates whether the content is commentary or the final answer.

You can use this field to handle each phase differently in your application. For example, commentary can be played or displayed as a short intermediate update, while final_answer can be reserved for the assistant’s completed response.

Handle unclear audio

The model should only act on audio it can understand with confidence. If the audio is unclear, the model should ask a brief clarification question instead of guessing.

Do not let the model infer missing words, call tools, capture entities, generate preambles, or spend hidden reasoning time trying to reconstruct what the user may have said.

Example:

User audio: “Check order three one-” [cut off]

Bad:

Assistant: I’ll check order 31 now.

Good:

Assistant: I heard only part of the order number. Could you repeat it digit by digit?

Capture exact entities

Many realtime workflows depend on exact values: order IDs, tracking numbers, email addresses, confirmation codes, account numbers, claim numbers, ticket IDs, support references, and phone numbers.

Voice makes this hard. Users speak quickly, group numbers in different ways, spell partial values, use filler, correct themselves mid-turn, or pronounce characters that sound alike. One wrong digit can fail a lookup or retrieve the wrong account.

Capture entities conservatively. Collect one value at a time, normalize only what is clear, confirm high-precision values before tool calls, and make every correction recoverable.

Collect one entity at a time

When a workflow needs multiple values, collect them one at a time. This prevents fields from blending together, especially in voice conversations.

Handle spelled-out characters

Use this when users spell IDs, codes, names, or email addresses one character at a time. The spoken form is input, not the final value.

Normalize spoken numbers carefully

For numeric identifiers, users may say digits individually, group them, or use natural number phrases. If the field expects one continuous numeric value, convert clear numeric speech into digits.

Confirm exact identifiers before tool calls

Order IDs, tracking numbers, account numbers, claim numbers, confirmation codes, and similar identifiers are high-precision fields. Confirm them before using them in a tool call.

For numeric identifiers, read the value back digit by digit. Reading the value as a full number can hide errors.

Example:

Assistant: Just to confirm, I heard 8… 3… 5… 2… 1. Is that right?

If the user corrects one character or digit, repeat the full corrected value before calling the tool.

Example:

Assistant: Got it. I have 8… 3… 5… 7… 1. Is that correct?

Confirm emails character by character

Email addresses are important values. Dots, dashes, underscores, repeated letters, and similar-sounding names can cause account lookup failures or send messages to the wrong address.

Ask the user to spell the email address:

Assistant: Could you spell the email address character by character so I can make sure I have it exactly right?

When reading it back, confirm the exact final address:

Assistant: Just to confirm, that is c-h-e-n at example dot com, right?

Entity collection workflow

Avoid literal instruction traps

gpt-realtime-2 follows instructions more literally than earlier realtime models. Prompts that worked well on older models may need tuning.

Use precise language. The model may prioritize the exact wording of an instruction over the broader behavior you intended. Broad or rigid rules can dominate the assistant’s behavior in surprising ways, especially when multiple rules overlap.

Be careful with constraint words such as must, only, never, and always. Use them when the behavior is truly required, not as general emphasis. Overusing hard constraints can make the assistant rigid, overly cautious, or unable to handle reasonable exceptions.

Prefer precise scope:

Avoid broad scope:

The broad version may cause unnecessary confirmations before harmless read-only lookups, such as checking order status, retrieving availability, or reading account information.

Literal interpretation example

General prompting recommendations:

  • Prefer explicit instructions over implied intent.
  • Avoid unnecessary constraint words unless behavior truly must be rigid.
  • Minimize contradictory guidance.
  • Be cautious with layered or competing priority instructions.
  • Test prompts incrementally. Small wording changes can have large behavioral effects.
  • When migrating from earlier realtime models, expect some prompts to require restructuring for best results.

Control language and accent separately

Language and accent should be controlled separately.

A user’s accent is not the same as their intended language. A user may speak English with a Hindi, Spanish, French, or Mandarin accent and still expect English responses.

Avoid broad language instructions such as:

These are too broad. The model may interpret accent, filler words, backchannels, or isolated foreign words as a reason to switch languages.

English language policy

Multilingual policy

Accent control

gpt-realtime-2 can follow accent instructions more strongly, but vague accent prompts can cause drift or unintended language switching.

Accent-control prompts work best when they specify:

  • the target accent;
  • which characteristics should remain stable;
  • the intended pacing, stress, and prosody;
  • whether accent adaptation should affect language choice.

Instead of:

Use:

Custom voices

Use Custom Voices when standard voices cannot reliably meet brand, accent, or character requirements.

Prompting can steer accent, pacing, and delivery, but it cannot fully replace voice design. For use cases that require consistent branded voice identity or accent fidelity, consider Custom Voices.

Custom Voices are available only to approved customers. Contact your account team for access.

Maintain state in long sessions

gpt-realtime-2 expands the realtime context window from 32k to 128k tokens, making it better suited for long sessions. For dense two-way conversations, 128k tokens is best thought of as roughly 1-2 hours of dense raw audio context. This will vary depending on tool use, internal reasoning, injected records, and other session details.

For long-context use cases, gpt-realtime-2 performs best when it can tell what information is current, what is background, and what should be ignored if sources conflict. Do not rely on the model to infer source priority from a raw transcript or large context dump. Use structure.

Use a structured pattern when starting a session with a large amount of context, such as retrieved records, prior conversation history, policies, summaries, account notes, or background documents.

Migrate from earlier realtime models

When migrating from earlier realtime models, treat the prompt as a behavior surface, not just text to port.

  1. Use Codex or a strong reasoning model to restructure the prompt around the latest Realtime prompting guidance. Include a link to this prompting guide to ground the migration in best practices.
  2. Set reasoning effort to low instead of the default. Increase only for workflows that require deeper planning.
  3. Audit tool names, parameters, enums, JSON schemas, and other settings to make sure they match the expected implementation.
  4. Remove stale examples. Add short examples for happy paths, ambiguity, interruptions, tool calls, and fallback behavior.
  5. Compare representative conversations before and after migration. Check for regressions against an existing eval and document intentional behavior changes.
  6. Run a final consistency pass. Confirm the prompt clearly separates hard requirements, defaults, tool rules, safety rules, and fallback behavior.
  7. Run evals, inspect representative failures, and iterate on the prompt until the target behaviors are reliable.

Realtime 1.5 Prompting Guide

gpt-realtime-1.5 is a speech-to-speech model in the Realtime API. The same gpt-realtime prompting guidance applies to this model.

Speech-to-speech systems are essential for enabling voice as a core AI interface. gpt-realtime-1.5 supports robust, usable realtime voice agents that can handle mission-critical workflows at scale.

Compared with earlier realtime preview models, gpt-realtime-1.5 delivers stronger instruction following, more reliable tool calling, better voice quality, and an overall smoother feel. These gains make it practical to move from chained approaches to true realtime experiences, cutting latency and producing responses that sound more natural and expressive.

Realtime models benefit from prompting techniques that wouldn’t directly apply to text-based models. This prompting guide starts with a suggested prompt skeleton, then walks through each part with practical tips, small patterns you can copy, and examples you can adapt to your use case.

General Tips

  • Iterate relentlessly: Small wording changes can make or break behavior.
    • Example: For unclear audio instruction, we swapped “inaudible” → “unintelligible” which improved noisy input handling.
  • Prefer bullets over paragraphs: Clear, short bullets outperform long paragraphs.
  • Guide with examples: The model closely follows sample phrases.
  • Be precise: Ambiguity or conflicting instructions = degraded performance similar to GPT-5.
  • Control language: Pin output to a target language if you see unwanted language switching.
  • Reduce repetition: Add a Variety rule to reduce robotic phrasing.
  • Use capitalized text for emphasis: Capitalizing key rules makes them stand out and easier for the model to follow.
  • Convert non-text rules to text: instead of writing “IF x > 3 THEN ESCALATE”, write, “IF MORE THAN THREE FAILURES THEN ESCALATE”.

Prompt Structure

Organizing your prompt makes it easier for the model to understand context and stay consistent across turns. It also makes it easier for you to iterate and modify problematic sections.

  • What it does: Use clear, labeled sections in your system prompt so the model can find and follow them. Keep each section focused on one thing.
  • How to adapt: Add domain-specific sections (e.g., Compliance, Brand Policy). Remove sections you don’t need (e.g., Reference Pronunciations if not struggling with pronunciation).

Example

Role and Objective

This section defines who the agent is and what “done” means. The examples show two different identities to demonstrate how tightly the model will adhere to role and objective when they’re explicit.

  • When to use: The model is not taking on the persona, role, or task scope you need.
  • What it does: Pins identity of the voice agent so that its responses are conditioned to that role description
  • How to adapt: Modify the role based on your use case

Example (model takes on a specific accent)

Earlier realtime preview:

gpt-realtime-1.5:

Example (model takes on a character)

Earlier realtime preview:

gpt-realtime-1.5:

gpt-realtime-1.5 is able to enact the specified role more reliably than earlier realtime preview models.

Personality and Tone

gpt-realtime-1.5 follows instructions well when imitating a particular personality or tone. You can tailor the voice experience and delivery depending on what your use case expects.

  • When to use: Responses feel flat, overly verbose, or inconsistent across turns.
  • What it does: Sets voice, brevity, and pacing so replies sound natural and consistent.
  • How to adapt: Tune warmth/formality and default length. For regulated domains, favor neutral precision. Add other subsections that are relevant to your use case.

Example

Example (multi-emotion)

gpt-realtime-1.5:

The model is able to adhere to the complex instructions and switch between three emotions throughout the audio response.

Speed Instructions

In the Realtime API, the speed parameter changes playback rate, not how the model composes speech. To actually sound faster, add instructions that can guide the pacing.

  • When to use: Users want faster speaking voice; playback speed (with speed parameter) alone doesn’t fix speaking style.
  • What it does: Tunes speaking style (brevity, cadence) independent of client playback speed.
  • How to adapt: Modify speed instruction to meet use case requirements.

Example

Earlier realtime preview:

gpt-realtime-1.5:

With explicit pacing instructions, gpt-realtime-1.5 can produce a noticeably faster pace without sounding too hurried.

Language Constraint

Language constraints ensure the model consistently responds in the intended language, even in challenging conditions like background noise or multilingual inputs.

  • When to use: To prevent accidental language switching in multilingual or noisy environments.
  • What it does: Locks output to the chosen language to prevent accidental language changes.
  • How to adapt: Switch “English” to your target language; or add more complex instructions based on your use case.

Example (pinning to one language)

These are the responses after applying the instruction using gpt-realtime-1.5.

👁 lang constraint en

Example (model teaches a language)

These are the responses after applying the instruction using gpt-realtime-1.5.

👁 multi language

The model is able to code-switch from one language to another based on custom instructions.

Reduce Repetition

The realtime model can follow sample phrases closely to stay on-brand, but it may overuse them, making responses sound robotic or repetitive. Adding a repetition rule helps maintain variety while preserving clarity and brand voice.

  • When to use: Outputs recycle the same openings, fillers, or sentence patterns across turns or sessions.
  • What it does: Adds a variety constraint—discourages repeated phrases, nudges synonyms and alternate sentence structures, and keeps required terms intact.
  • How to adapt: Tune strictness (e.g., “don’t reuse the same opener more than once every N turns”), whitelist must-keep phrases (legal/compliance/brand), and allow tighter phrasing where consistency matters.

Example

These are the responses before applying the instruction using gpt-realtime-1.5. The model repeats the same confirmation: Got it.

👁 repeat before

These are the responses after applying the instruction using gpt-realtime-1.5.

👁 repeat after

Now the model is able to vary its responses and confirmation and not sound robotic.

Reference Pronunciations

This section covers how to ensure the model pronounces important words, numbers, names, and terms correctly during spoken interactions.

  • When to use: Brand names, technical terms, or locations are often mispronounced.
  • What it does: Improves trust and clarity with phonetic hints.
  • How to adapt: Keep to a short list; update as you hear errors.

Example

Earlier realtime preview:

gpt-realtime-1.5:

With the reference pronunciation instructions, gpt-realtime-1.5 can correctly pronounce SQL as “sequel.”

Alphanumeric Pronunciations

Realtime S2S can blur or merge digits/letters when reading back key info (phone, credit card, order IDs). Explicit character-by-character confirmation prevents mishearing and drives clearer synthesis.

  • When to use: If the model struggles to capture or read back phone numbers, card numbers, 2FA codes, order IDs, serials, addresses, unit numbers, or mixed alphanumeric strings.
  • What it does: Forces the model to speak one character at a time with separators, then confirm with the user and reconfirm after corrections. Optionally uses a phonetic disambiguator for letters (e.g., “A as in Alpha”).

Example (general instruction section)

Tip: If you are following a conversation flow prompting strategy, you can specify which conversation state needs to apply the alpha-numeric pronunciations instruction.

Example (instruction in conversation state)

(taken from the conversation flow of the prompt of our openai-realtime-agents)

These are the responses before applying the instruction using gpt-realtime-1.5.

Sure! The number is 55119765423. Let me know if you need anything else!

These are the responses after applying the instruction using gpt-realtime-1.5.

Sure! The number is: 5-5-1-1-1-9-7-6-5-4-2-3. Please let me know if you need anything else!

Instructions

This section covers prompt guidance for instructing your model to solve your task, apply best practices, and fix possible problems.

Perhaps unsurprisingly, we recommend prompting patterns that are similar to GPT-4.1 for best results.

Instruction Following

Like GPT-4.1 and GPT-5, if the instructions are conflicting, ambiguous, or unclear, gpt-realtime-1.5 will perform worse.

  • When to use: Outputs drift from rules, skip phases, or misuse tools.
  • What it does: Uses an LLM to point out ambiguity, conflicts, and missing definitions before you ship.

Instructions Quality Prompt (can be used in ChatGPT or with API)

Use the following prompt with GPT-5 to identify problematic areas in your prompt that you can fix.

Prompt Optimization Meta Prompt (can be used in ChatGPT or with API)

This meta-prompt helps you improve your base system prompt by targeting a specific failure mode. Provide the current prompt and describe the issue you’re seeing, the model (GPT-5) will suggest refined variants that tighten constraints and reduce the problem.

No Audio or Unclear Audio

Sometimes the model thinks it hears something and tries to respond. You can add a custom instruction telling the model how to behave when it hears unclear audio or user input. Modify the desired behavior to fit your use case. For example, you may want the model to repeat the same question instead of asking for clarification.

  • When to use: Background noise, partial words, or silence trigger unwanted replies.
  • What it does: Stops spurious responses and creates graceful clarification.
  • How to adapt: Choose whether to ask for clarification or repeat the last question depending on use case.

Example (coughing and unclear audio)

These are the responses after applying the instruction using gpt-realtime-1.5.

In this example, the model asks for clarification after my (very) loud cough and unclear audio.

Background Music or Sounds

Occasionally, the model may generate unintended background music, humming, rhythmic noises, or sound-like artifacts during speech generation. These artifacts can diminish clarity, distract users, or make the assistant feel less professional. The following instructions help prevent or significantly reduce these occurrences.

  • When to use: Use when you observe unintended musical elements or sound effects in Realtime audio responses.
  • What it does: Steers the model to avoid generating these unwanted audio artifacts.
  • How to adapt: Adjust the instruction to try to explicitly suppress the specific sound patterns you are encountering.

Example

Tools

Use this section to tell the model how to use your functions and tools. Spell out when and when not to call a tool, which arguments to collect, what to say while a call is running, and how to handle errors or partial results.

Tool Selection

gpt-realtime-1.5 follows instructions closely. However, if you have instructions that conflict with what the model can access, such as mentioning tools in your prompt that are NOT passed in the tools list, it can lead to bad responses.

  • When to use: Prompts mention tools that aren’t actually available.
  • What it does: Reviews the available tools and system prompt to ensure they align.

Example

We need to ensure the same tools are available and the descriptions do not contradict each other:

Tool Call Preambles

Some use cases could benefit from the Realtime model providing an audio response at the same time as calling a tool. This leads to a better user experience, masking latency. You can modify the sample phrase to fit your use case.

  • When to use: Users need immediate confirmation at the same time as a tool call; helps mask latency.
  • What it does: Adds a short, consistent preamble before a tool call.

Example

These are the responses after applying the instruction using gpt-realtime-1.5.

👁 tool proactive

Using the instruction, the model outputs an audio response “I’m checking that right now” at the same time as the tool call.

Tool Call Preambles + Sample Phrases

If you want to control more closely what type of phrases the model outputs at the same time it calls a tool, you can add sample phrases in the tool spec description.

Example

Tool Calls Without Confirmation

Sometimes the model might ask for confirmation before a tool call. For some use cases, this can lead to poor experience for the end user since the model is not being proactive.

  • When to use: The agent asks for permission before obvious tool calls.
  • What it does: Removes unnecessary confirmation loops.

Example

These are the responses after applying the instruction using gpt-realtime-1.5.

👁 tool no confirm

In the example, you notice that the realtime model did not produce any response audio; it directly called the respective tool.

Tip: If you notice the model is jumping too quickly to call a tool, try softening the wording. For example, swapping out stronger terms like “proactive” with something gentler can help guide the model to take a calmer, less eager approach.

Tool Call Performance

As use cases grow more complex and the number of available tools increases, it becomes critical to explicitly guide the model on when to use each tool and just as importantly, when not to. Clear usage rules not only improve tool call accuracy but also help the model choose the right tool at the right time.

  • When to use: Model is struggling with tool call performance and needs the instructions to be explicit to reduce misuse.
  • What it does: Add instructions on when to “use/avoid” each tool. You can also add instructions on sequences of tool calls (after Tool call A, you can call Tool call B or C)

Example

Tip: If a tool call can fail unpredictably, add clear failure-handling instructions so the model responds gracefully.

Tool Level Behavior

You can fine-tune how the model behaves for specific tools instead of applying one global rule. For example, you may want READ tools to be called proactively, while WRITE tools require explicit confirmation.

  • When to use: Global instructions for proactiveness, confirmation, or preambles don’t suit every tool.
  • What it does: Adds per-tool behavior rules that define whether the model should call the tool immediately, confirm first, or speak a preamble before the call.

Example

Tool Output Formatting

Some tool outputs, especially long strings that must be repeated verbatim, can be out-of-distribution for the model. During training, tool outputs commonly look like JSON objects with named fields. If your tool returns a raw string and separately asks the model to “repeat exactly,” the model may be more prone to paraphrasing, truncation, or blending in its own preamble.

A practical fix is to make the tool output look like a normal tool result and make the verbatim requirement machine-explicit.

  • When to use: A tool returns long or complex structured content (multi-sentence instructions, handoff packets, IDs/links, policy summaries, multi-step procedures, etc.) and you observe truncation, paraphrasing, dropped fields, reordering, or the model blending in its own preamble/commentary.

  • What it does: Wraps the tool output in a small, explicit JSON envelope (e.g., response_text plus flags like require_repeat_verbatim, format, or content_type) so the response looks more in-distribution and the expected realization behavior is machine-clear.

  • How to adapt: Keep the schema minimal and stable. Clearly document the expected tool output shape in both your Tools instructions and next to the tool definition (e.g., “If require_repeat_verbatim is true, output exactly response_text and nothing else,” or “Render response_text as-is; do not add, omit, or reorder fields from the tool output.”).

Examples

Example: raw string (more error-prone)

Tool returns:

Model sometimes says:

  • “I’ve emailed you a verification link…” (paraphrase)

  • Drops the last sentence (truncation)

  • Adds extra commentary (“Can I help with anything else?”)

Example: wrapped JSON (more in-distribution, more reliable)

Tool returns:

Because this looks like a typical tool result (JSON object), the model generally has an easier time:

  • recognizing what the “authoritative” content is (response_text)

  • understanding the realization constraint (require_repeat_verbatim)

  • reproducing the tool output cleanly, without truncation or extra commentary

Rephrase Supervisor Tool (Responder-Thinker Architecture)

In many voice setups, the realtime model acts as the responder (speaks to the user) while a stronger text model acts as the thinker (does planning, policy lookups, SOP completion). Text replies are not automatically good for speech, so the responder must rephrase the thinker’s text into an audio-friendly response before generating audio.

  • When to use: When the responder’s spoken output sounds robotic, too long, or awkward after receiving a thinker response.
  • What it does: Adds clear instructions that guide the responder to rephrase the thinker’s text into a short, natural, speech-first reply.
  • How to adapt: Tweak phrasing style, openers, and brevity limits to match your use case expectations.

Example

Here’s an example without the rephrasing instruction:

Assistant: Your current credit card balance is positive at 32,323,232 AUD.

Here’s the same example with the rephrasing instruction:

Assistant: Just finished checking that—your credit card balance is thirty-two million three hundred twenty-three thousand two hundred thirty-two dollars in your favor. Your last payment was processed on August first. Does that match what you expected?

Common Tools

gpt-realtime-1.5 has been trained to effectively use the following common tools. If your use case needs similar behavior, keep the names, signatures, and descriptions close to these to maximize reliability and to be more in-distribution.

Below are some of the important common tools that the model has been trained on:

Example

Conversation Flow

This section covers how to structure the dialogue into clear, goal-driven phases so the model knows exactly what to do at each step. It defines the purpose of each phase, the instructions for moving through it, and the concrete “exit criteria” for transitioning to the next. This prevents the model from stalling, skipping steps, or jumping ahead, and ensures the conversation stays organized from greeting to resolution.

As well, by organizing your prompt into various conversation states, it becomes easier to identify error modes and iterate more effectively.

  • When to use: If conversations feel disorganized, stall before reaching the goal, or the model struggles to effectively complete the objective.
  • What it does: Breaks the interaction into phases with clear goals, instructions and exit criteria.
  • How to adapt: Rename phases to match your workflow; modify instructions for each phase to follow your intended behavior; keep “Exit when” concrete and minimal.

Example

Sample Phrases

Sample phrases act as “anchor examples” for the model. They show the style, brevity, and tone you want it to follow, without locking it into one rigid response.

  • When to use: Responses lack your brand style or are not consistent.
  • What it does: Provides sample phrases the model can vary to stay natural and brief.
  • How to adapt: Swap examples for brand-fit; keep the “do not always use” warning.

Example

Note: If your voice system ends up consistently only repeating the sample phrases, leading to a more robotic voice experience, try adding the Variety constraint. We’ve seen this fix the issue.

Conversation flow + Sample Phrases

It is a useful pattern to add sample phrases in the different conversation flow states to teach the model what a good response looks like:

Example

Advanced Conversation Flow

As use cases grow more complex, you’ll need a structure that scales while keeping the model effective. The key is balancing maintainability with simplicity: too many rigid states can overload the model, hurting performance and making conversations feel robotic.

A better approach is to design flows that reduce the model’s perceived complexity. By handling state in a structured but flexible way, you make it easier for the model to stay focused and responsive, which improves user experience.

Two common patterns for managing complex scenarios are:

  1. Conversation Flow as State Machine
  2. Dynamic Conversation Flow via session.updates

Conversation Flow as State Machine

Define your conversation as a JSON structure that encodes both states and transitions. This makes it easy to reason about coverage, identify edge cases, and track changes over time. Since it’s stored as code, you can version, diff, and extend it as your flow evolves. A state machine also gives you fine-grained control over exactly how and when the conversation moves from one state to another.

Example

Dynamic Conversation Flow

In this pattern, the conversation adapts in real time by updating the system prompt and tool list based on the current state. Instead of exposing the model to all possible rules and tools at once, you only provide what’s relevant to the active phase of the conversation.

When the end conditions for a state are met, you use session.update to transition, replacing the prompt and tools with those needed for the next phase.

This approach reduces the model’s cognitive load, making it easier for it to handle complex tasks without being distracted by unnecessary context.

Example

Safety & Escalation

Often with Realtime voice agents, having a reliable way to escalate to a human is important. In this section, you should modify the instructions on WHEN to escalate depending on your use case.

  • When to use: Model is struggling to determine when to properly escalate to a human or fallback system
  • What it does: Defines fast, reliable escalation and what to say.
  • How to adapt: Insert your own thresholds and what the model has to say.

Example

The first example shows conversation responses from gpt-4o-realtime-preview-2025-06-03 using the instruction.

👁 escalate 06

The second example shows conversation responses from gpt-realtime-1.5 using the instruction.

👁 escalate 07

gpt-realtime-1.5 is able to follow the instruction and escalate to a human more reliably.

Next steps

Loading docs agent...