GPT Realtime models do not properly detect input audio on all SIP calls

Majad 0 Reputation points

I would like to request assistance regarding the usage of OpenAI Realtime models over SIP connections from the Microsoft Foundry model provisioning.

I utilize a ‘gpt-realtime’ (GA version) model to handle SIP calls through a SIP trunk provider, in addition to a backend to authorize calls and supervise events. Over the last week I’ve had issues with this, mainly regarding intermittent situations where the model is unable to receive or process any audio input as communication starts. As mentioned, this behavior is inconsistent due to some of these calls not presenting this issue.

Up until the end of April, the service functioned normally, and all calls were able to interact with the model. The issue has appeared only in recent days.

I’ve discarded any issues with the SIP Trunk provider. We suspect there might be issues with the gpt-realtime SIP connection or configuration. gpt-realtime-1.5 was also tested but presented similar issues.

As such, I would appreciate your support clarifying the following:

-            If any changes have occurred to the core functionality of this model regarding SIP/API usage (I’ve followed Migration from Preview to GA version of Realtime API - Microsoft Foundry | Microsoft Learn and I’ve been using these changes, but I would like to know if any of these are critical for call flow, or if any other changes have occurred to SIP communications or Realtime models that are affecting my case.

-            If any issues or interferences related to gpt-realtime models and/or SIP communication are currently happening that might have an impact on functionality.

I appreciate your support with this matter.

  1. SRILAKSHMI C 19,195 Reputation points Microsoft External Staff Moderator

    Hello @Majad

    Thank you for the reaching out to Microsoft Q&A.

    Based on your description and the behavior observed (intermittent loss of audio ingestion while session.created is consistently successful), this does not appear to be a SIP trunk or full session creation failure. Instead, it points toward a Realtime session initialization + early audio ingestion timing/contract issue within Azure AI Foundry using Azure OpenAI.

    From your findings:

    • session.created is always received
    • SIP codec (G.711 μ-law) is validated
    • SIP provider confirms ACK/BYE behavior
    • VAD tuning changes had no impact
    • No Azure Service Health incidents identified
    • Issue persists across gpt-realtime and gpt-realtime-1.5

    This suggests the issue is not model-specific, but instead related to session initiation behavior and audio streaming alignment.

    Key Areas Identified

    1. Realtime API GA endpoint contract

    For GA Realtime usage, the service expects:

    • Correct GA endpoint format using /openai/v1
    • No legacy date-based api-version usage
    • No deprecated URL patterns for deployment routing

    If the integration still uses legacy-style endpoints or mixed routing behavior, it can result in:

    • Inconsistent session initialization
    • Intermittent audio ingestion failures at session start

    2. Early audio streaming vs session readiness

    A common pattern in Realtime SIP integrations is:

    • SIP session is established
    • session.created event is received
    • Audio streaming begins immediately

    If audio frames are sent before the internal Realtime session is fully ready to accept media:

    • Initial RTP/audio frames may be dropped
    • Session remains active but “silent”
    • Audio ingestion resumes only after re-sync

    This aligns closely with your intermittent behavior pattern.

    3. VAD configuration limitations

    Even though you are using:

    • server_vad
    • Adjusted threshold, padding, and silence duration

    These parameters control voice activity detection, but do not fully control:

    • Initial audio buffer readiness
    • First-frame ingestion behavior during session bootstrap

    So VAD tuning alone will not fully resolve startup ingestion issues.

    4. SIP media timing or early audio buffering

    Given that some calls terminate quickly (ACK/BYE observed), a possible contributing factor is:

    • No valid audio detected within initial ingestion window
    • Early RTP packets not aligned or delayed
    • Session interpreted as inactive or failed

    This is often sensitive to:

    • First RTP packet timing
    • Jitter at call start
    • SIP trunk early media behavior

    5. Model + quota + availability checks

    Even though less likely in your case, it is still recommended to confirm:

    • Deployment quota is not exhausted (TPM limits)
    • Model is still available in your region/SKU
    • No RBAC or permission constraints affecting runtime behavior

    Recommended Actions

    1. Verify GA endpoint compliance

    Ensure your integration uses:

    • /openai/v1 GA endpoint format
    • No legacy API versioning or hybrid routing patterns

    2. Introduce session stabilization delay

    After session.created Wait ~300 - 800ms before sending audio frames

    This allows:

    • Session initialization completion
    • Internal buffer readiness

    3. Align SIP --> Realtime audio start sequence

    Recommended sequence:

    1. SIP call connected
    2. session.created received
    3. Small delay (buffer stabilization)
    4. Start RTP audio streaming

    4. Validate RTP stream behavior at call start

    Work with SIP provider to ensure:

    • Consistent first RTP packet delivery
    • No delayed or bursty audio start
    • Stable 20ms packet intervals (for G.711 μ-law)

    5. Validate full request or response contract consistency

    Confirm:

    • No mixed usage of preview + GA Realtime contracts
    • Consistent deployment + endpoint pairing
    • No fallback to older API behaviors in any part of the stack

    Please refer this

    Use the GPT Realtime API for speech and audio (connection methods, SIP/WebRTC/WebSocket, GA /openai/v1 guidance): https://learn.microsoft.com/azure/foundry/openai/how-to/realtime-audio

    I Hope this helps. Do let me know if you have any further queries.

    Thank you!

  2. SRILAKSHMI C 19,195 Reputation points Microsoft External Staff Moderator

    Hi @Majad

    Did you get any chance to review the above response. Do let me know if you have any further queries.

    Thank you!

  3. Majad 0 Reputation points

    Hi SRILAKSHMI, I'm still debugging this issue, so I'll have an answer soon.

    Thank you for your patience.

  4. Majad 0 Reputation points

    Hi SRILAKSHMI, based on the response provided I determined the issue occurs when a call is started and the model is expected to provide a greeting message.

    This was not an issue before, since conversations used to be started at the "open" WebSocket message. This has been since changed to be used after "open" is finalized.
    I've been trying to time these responses to allow the model to properly start the conversation, but I noted I never receive session.created on any calls (even succesful ones), and only get a session.updated when I send the first directives.

    Is this normal behavior? Or is there another way to ensure session the calls are properly started?


Sign in to comment

2 answers

  1. Majad 0 Reputation points

    Hello Jerald,

    Thank you for your response. Based on my testing, I was able to gather the following findings:

    • No incidents were identified in Azure Service Health related to the use of OpenAI Realtime or SIP in any of the supported regions.
    • The latest updates to the Realtime models on the Microsoft website do not indicate changes that would impact this implementation, as they mainly apply to the gpt-realtime-2 model.
    • Adjustments were made on the backend to ensure that the session.created event is successfully established; however, interruptions during certain calls still persist.
    • I am currently using server_vad. My initial configuration values were:
      • Threshold: 0.6
      • Prefix padding (ms): 300
      • Silence duration (ms): 800
      I have tested several variations—primarily lowering the threshold and increasing both prefix padding and silence duration—but none of these changes resolved the issue.
    • I verified that the audio codec and format are correctly supported by the SIP trunk provider, using G.711 µ-law (ulaw). The provider also reported that the ACK/BYE messages are being sent by the model, which appears to terminate the call almost immediately.
    • The OpenAI_Webhook_Secret remains consistent across all requests, and, as mentioned previously, every call successfully results in a valid session.created event.
    0 comments No comments

    Sign in to comment
  2. Jerald Felix 13,500 Reputation points Volunteer Moderator

    Hello Majad,

    Greetings! Thanks for raising this question in Q&A forum.

    The intermittent audio input detection failure you're experiencing on SIP calls with the GPT Realtime model is most likely caused by a combination of two things a subtle race condition in how the Realtime model initializes audio input processing at the start of a session, and potential codec or audio format negotiation timing differences between individual SIP calls. Since the issue is inconsistent (some calls work, some don't) and started appearing only recently without any changes on your end, it strongly suggests a backend service-side change or regression introduced after April that is affecting how the SIP audio stream gets picked up at session start.

    This type of intermittent audio failure is a known behavior pattern with the current GPT Realtime models it is related to how the model handles audio stream initialization internally, and is not necessarily an error in your integration logic.

    Here are the steps I'd recommend to investigate and work around this:

    First, check the Azure Service Health dashboard at https://status.azure.com and filter for Azure OpenAI or Azure AI Foundry in your deployed regions (East US 2 or Sweden Central). There may be an ongoing or recently resolved incident that correlates with the late-April timeframe when things started breaking.

    Review the Azure OpenAI What's New page at https://learn.microsoft.com/en-us/azure/ai-foundry/openai/whats-new for any updates pushed after April. The Realtime API has recently received SIP support for telephony connections, and new model features have been added — it's possible a backend rollout introduced a regression in audio detection timing for existing SIP integrations.

    On your backend, check whether the session.created event is being fully received and acknowledged before your SIP trunk starts sending audio. A common cause of intermittent input failure is that audio from the caller arrives at the model endpoint before the session is fully ready. Add a small buffer or gate the audio stream until you receive session.created confirmation.

    Verify your turn_detection configuration in the session.update event. If you're using server_vad (Voice Activity Detection), confirm that the threshold, silence_duration_ms, and prefix_padding_ms values are appropriately set. Too high a silence threshold can cause the model to miss the beginning of a caller's speech, especially for calls where audio starts immediately.

    Check the audio codec and format being negotiated on the SIP calls that fail versus those that succeed. The GPT Realtime API over SIP expects PCM16 audio at 24kHz. If some SIP calls are negotiating a different codec (such as G.711 or G.729) and your transcoder introduces even a slight delay or format mismatch, the model may fail to pick up the initial audio stream.

    Also verify your OPENAI_WEBHOOK_SECRET environment variable matches the secret from webhook creation, and ensure you're passing raw request body bytes — not parsed JSON — to the unwrap function, and that no middleware is modifying the request body before verification. These subtle configuration issues can cause intermittent failures that look like audio processing problems.

    Since gpt-realtime-1.5 also showed the same issue, this helps rule out a model-version-specific bug and points more toward the SIP session initialization or backend infrastructure. I would strongly recommend raising an Azure Support ticket with the following details: your deployment region, the model version used, sample call timestamps where the failure occurred, and the specific event logs from your backend showing what events were received (or not received) from the Realtime API during the failed calls. This will allow the Azure OpenAI engineering team to check for any backend changes that may have impacted SIP audio handling after April.

    The most actionable next step while you wait for support is to add a guard in your backend that buffers incoming SIP audio and only begins forwarding it to the Realtime API once the session.created event has been confirmed — this alone has resolved similar intermittent input detection issues for other developers.

    If this answer helps you kindly accept the answer which will help others who have similar questions.

    Best Regards,

    Jerald Felix.

    1. Majad 0 Reputation points

      Hello Jerald, pardon for the delay on properly setting up a comment.

      Thank you for your response. Based on my testing, I was able to gather the following findings:

      • No incidents were identified in Azure Service Health related to the use of OpenAI Realtime or SIP in any of the supported regions.
      • The latest updates to the Realtime models on the Microsoft website do not indicate changes that would impact this implementation, as they mainly apply to the gpt-realtime-2 model.
      • Adjustments were made on the backend to ensure that the session.created event is successfully established; however, interruptions during certain calls still persist.
      • I am currently using server_vad. My initial configuration values were:
        • Threshold: 0.6
          • Prefix padding (ms): 300
            • Silence duration (ms): 800
            I have tested several variations—primarily lowering the threshold and increasing both prefix padding and silence duration—but none of these changes resolved the issue.
            • I verified that the audio codec and format are correctly supported by the SIP trunk provider, using G.711 µ-law (ulaw). The provider also reported that the ACK/BYE messages are being sent by the model, which appears to terminate the call almost immediately.
      • The OpenAI_Webhook_Secret remains consistent across all requests, and, as mentioned previously, every call successfully results in a valid session.created event.

    Sign in to comment
Sign in to answer

Your answer