GPT Realtime models do not properly detect input audio on all SIP calls

👁 Image

Majad 0 Reputation points

I would like to request assistance regarding the usage of OpenAI Realtime models over SIP connections from the Microsoft Foundry model provisioning.

I utilize a ‘gpt-realtime’ (GA version) model to handle SIP calls through a SIP trunk provider, in addition to a backend to authorize calls and supervise events. Over the last week I’ve had issues with this, mainly regarding intermittent situations where the model is unable to receive or process any audio input as communication starts. As mentioned, this behavior is inconsistent due to some of these calls not presenting this issue.

Up until the end of April, the service functioned normally, and all calls were able to interact with the model. The issue has appeared only in recent days.

I’ve discarded any issues with the SIP Trunk provider. We suspect there might be issues with the gpt-realtime SIP connection or configuration. gpt-realtime-1.5 was also tested but presented similar issues.

As such, I would appreciate your support clarifying the following:

- If any changes have occurred to the core functionality of this model regarding SIP/API usage (I’ve followed Migration from Preview to GA version of Realtime API - Microsoft Foundry | Microsoft Learn and I’ve been using these changes, but I would like to know if any of these are critical for call flow, or if any other changes have occurred to SIP communications or Realtime models that are affecting my case.

- If any issues or interferences related to gpt-realtime models and/or SIP communication are currently happening that might have an impact on functionality.

I appreciate your support with this matter.

👁 Image
SRILAKSHMI C 19,195 Reputation points • Microsoft External Staff • Moderator
Hello @Majad

Thank you for the reaching out to Microsoft Q&A.

Based on your description and the behavior observed (intermittent loss of audio ingestion while session.created is consistently successful), this does not appear to be a SIP trunk or full session creation failure. Instead, it points toward a Realtime session initialization + early audio ingestion timing/contract issue within Azure AI Foundry using Azure OpenAI.

From your findings:

session.created is always received

SIP codec (G.711 μ-law) is validated

SIP provider confirms ACK/BYE behavior

VAD tuning changes had no impact

No Azure Service Health incidents identified

Issue persists across gpt-realtime and gpt-realtime-1.5

This suggests the issue is not model-specific, but instead related to session initiation behavior and audio streaming alignment.

Key Areas Identified

1. Realtime API GA endpoint contract

For GA Realtime usage, the service expects:

Correct GA endpoint format using /openai/v1

No legacy date-based api-version usage

No deprecated URL patterns for deployment routing

If the integration still uses legacy-style endpoints or mixed routing behavior, it can result in:

Inconsistent session initialization

Intermittent audio ingestion failures at session start

2. Early audio streaming vs session readiness

A common pattern in Realtime SIP integrations is:

SIP session is established

session.created event is received

Audio streaming begins immediately

If audio frames are sent before the internal Realtime session is fully ready to accept media:

Initial RTP/audio frames may be dropped

Session remains active but “silent”

Audio ingestion resumes only after re-sync

This aligns closely with your intermittent behavior pattern.

3. VAD configuration limitations

Even though you are using:

server_vad

Adjusted threshold, padding, and silence duration

These parameters control voice activity detection, but do not fully control:

Initial audio buffer readiness

First-frame ingestion behavior during session bootstrap

So VAD tuning alone will not fully resolve startup ingestion issues.

4. SIP media timing or early audio buffering

Given that some calls terminate quickly (ACK/BYE observed), a possible contributing factor is:

No valid audio detected within initial ingestion window

Early RTP packets not aligned or delayed

Session interpreted as inactive or failed

This is often sensitive to:

First RTP packet timing

Jitter at call start

SIP trunk early media behavior

5. Model + quota + availability checks

Even though less likely in your case, it is still recommended to confirm:

Deployment quota is not exhausted (TPM limits)

Model is still available in your region/SKU

No RBAC or permission constraints affecting runtime behavior

Recommended Actions

1. Verify GA endpoint compliance

Ensure your integration uses:

/openai/v1 GA endpoint format

No legacy API versioning or hybrid routing patterns

2. Introduce session stabilization delay

After session.created Wait ~300 - 800ms before sending audio frames

This allows:

Session initialization completion

Internal buffer readiness

3. Align SIP --> Realtime audio start sequence

Recommended sequence:

SIP call connected

session.created received

Small delay (buffer stabilization)

Start RTP audio streaming

4. Validate RTP stream behavior at call start

Work with SIP provider to ensure:

Consistent first RTP packet delivery

No delayed or bursty audio start

Stable 20ms packet intervals (for G.711 μ-law)

5. Validate full request or response contract consistency

Confirm:

No mixed usage of preview + GA Realtime contracts

Consistent deployment + endpoint pairing

No fallback to older API behaviors in any part of the stack

Please refer this

Use the GPT Realtime API for speech and audio (connection methods, SIP/WebRTC/WebSocket, GA /openai/v1 guidance): https://learn.microsoft.com/azure/foundry/openai/how-to/realtime-audio

I Hope this helps. Do let me know if you have any further queries.

Thank you!
👁 Image

SRILAKSHMI C 19,195 Reputation points • Microsoft External Staff • Moderator

Hi @Majad

Did you get any chance to review the above response. Do let me know if you have any further queries.

Thank you!
👁 Image

Majad 0 Reputation points

Hi SRILAKSHMI, I'm still debugging this issue, so I'll have an answer soon.

Thank you for your patience.
👁 Image

Majad 0 Reputation points

Hi SRILAKSHMI, based on the response provided I determined the issue occurs when a call is started and the model is expected to provide a greeting message.

This was not an issue before, since conversations used to be started at the "open" WebSocket message. This has been since changed to be used after "open" is finalized.
I've been trying to time these responses to allow the model to properly start the conversation, but I noted I never receive session.created on any calls (even succesful ones), and only get a session.updated when I send the first directives.

Is this normal behavior? Or is there another way to ensure session the calls are properly started?

2 answers

👁 Image

Majad 0 Reputation points
Hello Jerald,

Thank you for your response. Based on my testing, I was able to gather the following findings:
- No incidents were identified in Azure Service Health related to the use of OpenAI Realtime or SIP in any of the supported regions.
- The latest updates to the Realtime models on the Microsoft website do not indicate changes that would impact this implementation, as they mainly apply to the gpt-realtime-2 model.
- Adjustments were made on the backend to ensure that the session.created event is successfully established; however, interruptions during certain calls still persist.
- I am currently using server_vad. My initial configuration values were:
  
  Threshold: 0.6
  
  Prefix padding (ms): 300
  
  Silence duration (ms): 800
  
  I have tested several variations—primarily lowering the threshold and increasing both prefix padding and silence duration—but none of these changes resolved the issue.
- I verified that the audio codec and format are correctly supported by the SIP trunk provider, using G.711 µ-law (ulaw). The provider also reported that the ACK/BYE messages are being sent by the model, which appears to terminate the call almost immediately.
- The OpenAI_Webhook_Secret remains consistent across all requests, and, as mentioned previously, every call successfully results in a valid session.created event.
0 comments No comments

Sign in to comment
👁 Image

Jerald Felix 13,500 Reputation points • Volunteer Moderator

Hello Majad,

Greetings! Thanks for raising this question in Q&A forum.

The intermittent audio input detection failure you're experiencing on SIP calls with the GPT Realtime model is most likely caused by a combination of two things a subtle race condition in how the Realtime model initializes audio input processing at the start of a session, and potential codec or audio format negotiation timing differences between individual SIP calls. Since the issue is inconsistent (some calls work, some don't) and started appearing only recently without any changes on your end, it strongly suggests a backend service-side change or regression introduced after April that is affecting how the SIP audio stream gets picked up at session start.

This type of intermittent audio failure is a known behavior pattern with the current GPT Realtime models it is related to how the model handles audio stream initialization internally, and is not necessarily an error in your integration logic.

Here are the steps I'd recommend to investigate and work around this:

First, check the Azure Service Health dashboard at https://status.azure.com and filter for Azure OpenAI or Azure AI Foundry in your deployed regions (East US 2 or Sweden Central). There may be an ongoing or recently resolved incident that correlates with the late-April timeframe when things started breaking.

Review the Azure OpenAI What's New page at https://learn.microsoft.com/en-us/azure/ai-foundry/openai/whats-new for any updates pushed after April. The Realtime API has recently received SIP support for telephony connections, and new model features have been added — it's possible a backend rollout introduced a regression in audio detection timing for existing SIP integrations.

On your backend, check whether the session.created event is being fully received and acknowledged before your SIP trunk starts sending audio. A common cause of intermittent input failure is that audio from the caller arrives at the model endpoint before the session is fully ready. Add a small buffer or gate the audio stream until you receive session.created confirmation.

Verify your turn_detection configuration in the session.update event. If you're using server_vad (Voice Activity Detection), confirm that the threshold, silence_duration_ms, and prefix_padding_ms values are appropriately set. Too high a silence threshold can cause the model to miss the beginning of a caller's speech, especially for calls where audio starts immediately.

Check the audio codec and format being negotiated on the SIP calls that fail versus those that succeed. The GPT Realtime API over SIP expects PCM16 audio at 24kHz. If some SIP calls are negotiating a different codec (such as G.711 or G.729) and your transcoder introduces even a slight delay or format mismatch, the model may fail to pick up the initial audio stream.

Also verify your OPENAI_WEBHOOK_SECRET environment variable matches the secret from webhook creation, and ensure you're passing raw request body bytes — not parsed JSON — to the unwrap function, and that no middleware is modifying the request body before verification. These subtle configuration issues can cause intermittent failures that look like audio processing problems.

Since gpt-realtime-1.5 also showed the same issue, this helps rule out a model-version-specific bug and points more toward the SIP session initialization or backend infrastructure. I would strongly recommend raising an Azure Support ticket with the following details: your deployment region, the model version used, sample call timestamps where the failure occurred, and the specific event logs from your backend showing what events were received (or not received) from the Realtime API during the failed calls. This will allow the Azure OpenAI engineering team to check for any backend changes that may have impacted SIP audio handling after April.

The most actionable next step while you wait for support is to add a guard in your backend that buffers incoming SIP audio and only begins forwarding it to the Realtime API once the session.created event has been confirmed — this alone has resolved similar intermittent input detection issues for other developers.

If this answer helps you kindly accept the answer which will help others who have similar questions.

Best Regards,

Jerald Felix.
👁 Image

Majad 0 Reputation points

Hello Jerald, pardon for the delay on properly setting up a comment.

Thank you for your response. Based on my testing, I was able to gather the following findings:

No incidents were identified in Azure Service Health related to the use of OpenAI Realtime or SIP in any of the supported regions.

The latest updates to the Realtime models on the Microsoft website do not indicate changes that would impact this implementation, as they mainly apply to the gpt-realtime-2 model.

Adjustments were made on the backend to ensure that the session.created event is successfully established; however, interruptions during certain calls still persist.

I am currently using server_vad. My initial configuration values were:

Threshold: 0.6

Prefix padding (ms): 300

Silence duration (ms): 800

I have tested several variations—primarily lowering the threshold and increasing both prefix padding and silence duration—but none of these changes resolved the issue.

I verified that the audio codec and format are correctly supported by the SIP trunk provider, using G.711 µ-law (ulaw). The provider also reported that the ACK/BYE messages are being sent by the model, which appears to terminate the call almost immediately.

The OpenAI_Webhook_Secret remains consistent across all requests, and, as mentioned previously, every call successfully results in a valid session.created event.
Sign in to comment

URL: https://learn.microsoft.com/en-us/answers/questions/5911976/gpt-realtime-models-do-not-properly-detect-input-a

⇱ GPT Realtime models do not properly detect input audio on all SIP calls - Microsoft Q&A

GPT Realtime models do not properly detect input audio on all SIP calls

2 answers

Your answer