GPT Realtime models do not properly detect input audio on all SIP calls
I would like to request assistance regarding the usage of OpenAI Realtime models over SIP connections from the Microsoft Foundry model provisioning.
I utilize a ‘gpt-realtime’ (GA version) model to handle SIP calls through a SIP trunk provider, in addition to a backend to authorize calls and supervise events. Over the last week I’ve had issues with this, mainly regarding intermittent situations where the model is unable to receive or process any audio input as communication starts. As mentioned, this behavior is inconsistent due to some of these calls not presenting this issue.
Up until the end of April, the service functioned normally, and all calls were able to interact with the model. The issue has appeared only in recent days.
I’ve discarded any issues with the SIP Trunk provider. We suspect there might be issues with the gpt-realtime SIP connection or configuration. gpt-realtime-1.5 was also tested but presented similar issues.
As such, I would appreciate your support clarifying the following:
- If any changes have occurred to the core functionality of this model regarding SIP/API usage (I’ve followed Migration from Preview to GA version of Realtime API - Microsoft Foundry | Microsoft Learn and I’ve been using these changes, but I would like to know if any of these are critical for call flow, or if any other changes have occurred to SIP communications or Realtime models that are affecting my case.
- If any issues or interferences related to gpt-realtime models and/or SIP communication are currently happening that might have an impact on functionality.
I appreciate your support with this matter.
-
SRILAKSHMI C 19,195 Reputation points • Microsoft External Staff • Moderator
Hello @Majad
Thank you for the reaching out to Microsoft Q&A.
Based on your description and the behavior observed (intermittent loss of audio ingestion while
session.createdis consistently successful), this does not appear to be a SIP trunk or full session creation failure. Instead, it points toward a Realtime session initialization + early audio ingestion timing/contract issue within Azure AI Foundry using Azure OpenAI.From your findings:
-
session.createdis always received - SIP codec (G.711 μ-law) is validated
- SIP provider confirms ACK/BYE behavior
- VAD tuning changes had no impact
- No Azure Service Health incidents identified
- Issue persists across
gpt-realtimeandgpt-realtime-1.5
This suggests the issue is not model-specific, but instead related to session initiation behavior and audio streaming alignment.
Key Areas Identified
1. Realtime API GA endpoint contract
For GA Realtime usage, the service expects:
- Correct GA endpoint format using
/openai/v1 - No legacy date-based
api-versionusage - No deprecated URL patterns for deployment routing
If the integration still uses legacy-style endpoints or mixed routing behavior, it can result in:
- Inconsistent session initialization
- Intermittent audio ingestion failures at session start
2. Early audio streaming vs session readiness
A common pattern in Realtime SIP integrations is:
- SIP session is established
-
session.createdevent is received - Audio streaming begins immediately
If audio frames are sent before the internal Realtime session is fully ready to accept media:
- Initial RTP/audio frames may be dropped
- Session remains active but “silent”
- Audio ingestion resumes only after re-sync
This aligns closely with your intermittent behavior pattern.
3. VAD configuration limitations
Even though you are using:
-
server_vad - Adjusted threshold, padding, and silence duration
These parameters control voice activity detection, but do not fully control:
- Initial audio buffer readiness
- First-frame ingestion behavior during session bootstrap
So VAD tuning alone will not fully resolve startup ingestion issues.
4. SIP media timing or early audio buffering
Given that some calls terminate quickly (ACK/BYE observed), a possible contributing factor is:
- No valid audio detected within initial ingestion window
- Early RTP packets not aligned or delayed
- Session interpreted as inactive or failed
This is often sensitive to:
- First RTP packet timing
- Jitter at call start
- SIP trunk early media behavior
5. Model + quota + availability checks
Even though less likely in your case, it is still recommended to confirm:
- Deployment quota is not exhausted (TPM limits)
- Model is still available in your region/SKU
- No RBAC or permission constraints affecting runtime behavior
Recommended Actions
1. Verify GA endpoint compliance
Ensure your integration uses:
-
/openai/v1GA endpoint format - No legacy API versioning or hybrid routing patterns
2. Introduce session stabilization delay
After
session.createdWait ~300 - 800ms before sending audio framesThis allows:
- Session initialization completion
- Internal buffer readiness
3. Align SIP --> Realtime audio start sequence
Recommended sequence:
- SIP call connected
-
session.createdreceived - Small delay (buffer stabilization)
- Start RTP audio streaming
4. Validate RTP stream behavior at call start
Work with SIP provider to ensure:
- Consistent first RTP packet delivery
- No delayed or bursty audio start
- Stable 20ms packet intervals (for G.711 μ-law)
5. Validate full request or response contract consistency
Confirm:
- No mixed usage of preview + GA Realtime contracts
- Consistent deployment + endpoint pairing
- No fallback to older API behaviors in any part of the stack
Please refer this
Use the GPT Realtime API for speech and audio (connection methods, SIP/WebRTC/WebSocket, GA
/openai/v1guidance): https://learn.microsoft.com/azure/foundry/openai/how-to/realtime-audioI Hope this helps. Do let me know if you have any further queries.
Thank you!
-
-
SRILAKSHMI C 19,195 Reputation points • Microsoft External Staff • Moderator
Hi @Majad
Did you get any chance to review the above response. Do let me know if you have any further queries.
Thank you!
-
Majad 0 Reputation points
Hi SRILAKSHMI, based on the response provided I determined the issue occurs when a call is started and the model is expected to provide a greeting message.
This was not an issue before, since conversations used to be started at the "open" WebSocket message. This has been since changed to be used after "open" is finalized.
I've been trying to time these responses to allow the model to properly start the conversation, but I noted I never receivesession.createdon any calls (even succesful ones), and only get asession.updatedwhen I send the first directives.Is this normal behavior? Or is there another way to ensure session the calls are properly started?
Sign in to comment
2 answers
-
Hello Jerald,
Thank you for your response. Based on my testing, I was able to gather the following findings:
- No incidents were identified in Azure Service Health related to the use of OpenAI Realtime or SIP in any of the supported regions.
- The latest updates to the Realtime models on the Microsoft website do not indicate changes that would impact this implementation, as they mainly apply to the gpt-realtime-2 model.
- Adjustments were made on the backend to ensure that the
session.createdevent is successfully established; however, interruptions during certain calls still persist. - I am currently using
server_vad. My initial configuration values were:- Threshold: 0.6
- Prefix padding (ms): 300
- Silence duration (ms): 800
- I verified that the audio codec and format are correctly supported by the SIP trunk provider, using G.711 µ-law (ulaw). The provider also reported that the ACK/BYE messages are being sent by the model, which appears to terminate the call almost immediately.
- The
OpenAI_Webhook_Secretremains consistent across all requests, and, as mentioned previously, every call successfully results in a validsession.createdevent.
-
Jerald Felix 13,500 Reputation points • Volunteer Moderator
Hello Majad,
Greetings! Thanks for raising this question in Q&A forum.
The intermittent audio input detection failure you're experiencing on SIP calls with the GPT Realtime model is most likely caused by a combination of two things a subtle race condition in how the Realtime model initializes audio input processing at the start of a session, and potential codec or audio format negotiation timing differences between individual SIP calls. Since the issue is inconsistent (some calls work, some don't) and started appearing only recently without any changes on your end, it strongly suggests a backend service-side change or regression introduced after April that is affecting how the SIP audio stream gets picked up at session start.
This type of intermittent audio failure is a known behavior pattern with the current GPT Realtime models it is related to how the model handles audio stream initialization internally, and is not necessarily an error in your integration logic.
Here are the steps I'd recommend to investigate and work around this:
First, check the Azure Service Health dashboard at https://status.azure.com and filter for Azure OpenAI or Azure AI Foundry in your deployed regions (East US 2 or Sweden Central). There may be an ongoing or recently resolved incident that correlates with the late-April timeframe when things started breaking.
Review the Azure OpenAI What's New page at https://learn.microsoft.com/en-us/azure/ai-foundry/openai/whats-new for any updates pushed after April. The Realtime API has recently received SIP support for telephony connections, and new model features have been added — it's possible a backend rollout introduced a regression in audio detection timing for existing SIP integrations.
On your backend, check whether the
session.createdevent is being fully received and acknowledged before your SIP trunk starts sending audio. A common cause of intermittent input failure is that audio from the caller arrives at the model endpoint before the session is fully ready. Add a small buffer or gate the audio stream until you receivesession.createdconfirmation.Verify your
turn_detectionconfiguration in thesession.updateevent. If you're usingserver_vad(Voice Activity Detection), confirm that thethreshold,silence_duration_ms, andprefix_padding_msvalues are appropriately set. Too high a silence threshold can cause the model to miss the beginning of a caller's speech, especially for calls where audio starts immediately.Check the audio codec and format being negotiated on the SIP calls that fail versus those that succeed. The GPT Realtime API over SIP expects PCM16 audio at 24kHz. If some SIP calls are negotiating a different codec (such as G.711 or G.729) and your transcoder introduces even a slight delay or format mismatch, the model may fail to pick up the initial audio stream.
Also verify your
OPENAI_WEBHOOK_SECRETenvironment variable matches the secret from webhook creation, and ensure you're passing raw request body bytes — not parsed JSON — to the unwrap function, and that no middleware is modifying the request body before verification. These subtle configuration issues can cause intermittent failures that look like audio processing problems.Since
gpt-realtime-1.5also showed the same issue, this helps rule out a model-version-specific bug and points more toward the SIP session initialization or backend infrastructure. I would strongly recommend raising an Azure Support ticket with the following details: your deployment region, the model version used, sample call timestamps where the failure occurred, and the specific event logs from your backend showing what events were received (or not received) from the Realtime API during the failed calls. This will allow the Azure OpenAI engineering team to check for any backend changes that may have impacted SIP audio handling after April.The most actionable next step while you wait for support is to add a guard in your backend that buffers incoming SIP audio and only begins forwarding it to the Realtime API once the
session.createdevent has been confirmed — this alone has resolved similar intermittent input detection issues for other developers.If this answer helps you kindly accept the answer which will help others who have similar questions.
Best Regards,
Jerald Felix.
-
Majad 0 Reputation points
Hello Jerald, pardon for the delay on properly setting up a comment.
Thank you for your response. Based on my testing, I was able to gather the following findings:
- No incidents were identified in Azure Service Health related to the use of OpenAI Realtime or SIP in any of the supported regions.
- The latest updates to the Realtime models on the Microsoft website do not indicate changes that would impact this implementation, as they mainly apply to the gpt-realtime-2 model.
- Adjustments were made on the backend to ensure that the
session.createdevent is successfully established; however, interruptions during certain calls still persist. - I am currently using
server_vad. My initial configuration values were:- Threshold: 0.6
- Prefix padding (ms): 300
- Silence duration (ms): 800
- I verified that the audio codec and format are correctly supported by the SIP trunk provider, using G.711 µ-law (ulaw). The provider also reported that the ACK/BYE messages are being sent by the model, which appears to terminate the call almost immediately.
- Prefix padding (ms): 300
- Threshold: 0.6
- The
OpenAI_Webhook_Secretremains consistent across all requests, and, as mentioned previously, every call successfully results in a validsession.createdevent.
Sign in to comment -
