Azure Speech Service: ConversationTranscriber via Private Endpoint returns 0 segments with 140s session_stopped delay - canadacentral

Amandeep Sadioura 0 Reputation points

Service Azure Cognitive Services — Speech Service (azure-cognitiveservices-speech==1.46.0, Python), AKS canadacentral, Private Endpoint.

Scenario Using ConversationTranscriber with the universal/v2 real-time endpoint accessed via a Cognitive Services Private Endpoint from an AKS cluster. The session establishes successfully but no transcription results are returned.

Result session_started fires in under 1s confirming the WebSocket connection is established. All audio is streamed successfully. However session_stopped fires after ~140s with 0 segments and no CANCELED event or error details regardless of audio length.

Environment

  • Speech resource region: canadacentral
  • SDK: azure-cognitiveservices-speech==1.46.0 (Python)
  • Endpoint: wss://<resource>.cognitiveservices.azure.com/stt/speech/universal/v2
  • Private Endpoint sub-resource: account
  • Private DNS zone: privatelink.cognitiveservices.azure.com with A record - private IP
  • AKS - Private Endpoint: TCP 443 reachable, NSG rules allow traffic in both directions

Troubleshooting steps taken

Check Status
DNS resolution Resolves to private IP via private DNS zone
Private endpoint sub-resource account
TCP 443 to private endpoint Reachable from AKS pod
NSG rules Bidirectional TCP 443 allowed between AKS and PE subnet
session_started Fires in <1s

All infrastructure verified on our end. No firewall between AKS and private endpoint — only NSGs.

Minimal reproducible steps (run inside AKS pod, 3s of silence)

import azure.cognitiveservices.speech as s
import time, os

cfg = s.SpeechConfig(
 subscription=os.environ['AZURE_SPEECH_KEY'],
 endpoint='wss://<resource>.cognitiveservices.azure.com/stt/speech/universal/v2'
)
fmt = s.audio.AudioStreamFormat(16000, 16, 1)
ps = s.audio.PushAudioInputStream(stream_format=fmt)
r = s.transcription.ConversationTranscriber(cfg, s.audio.AudioConfig(stream=ps))

t = time.time()
r.session_started.connect(lambda e: print(f'session_started T+{time.time()-t:.1f}s'))
r.session_stopped.connect(lambda e: print(f'session_stopped T+{time.time()-t:.1f}s'))
r.transcribed.connect(lambda e: print(f'transcribed: {e.result.text}'))

r.start_transcribing_async().get()
time.sleep(3)
ps.close()
time.sleep(180)

Output:

session_started T+0.4s
session_stopped T+140.2s

Question Does ConversationTranscriber via universal/v2 fully support Cognitive Services Private Endpoints in canadacentral? Specifically, does the private link account sub-resource cover the complete real-time diarization result delivery path, or are there additional endpoints required that are not covered by the private endpoint configuration?

0 comments No comments

Sign in to comment

2 answers

  1. Amandeep Sadioura 0 Reputation points

    Hi Harshitha,

    First of all, thanks for your helpful response and following it we are able to resolve the issue Have a nice day

    1. Manas Mohanty 17,185 Reputation points Microsoft External Staff Moderator

      Hey Amandeep Sadioura

      Thank you for your inputs here on forum

      I was analysing the code and architecture at my side.

      As per my last experiences -

      Endpoint syntax changes in VNET scenario compared to public endpoint and We have to allow traffic to AKS in outbound rules.

      Custom domain pointer was mentioned earlier.

      We also need Cognitive Service tags in outbound rules.

      cfg = s.SpeechConfig(subscription=KEY, region="canadacentral" )
      
      

      Reference - https://learn.microsoft.com/en-us/azure/ai-services/speech-service/speech-service-vnet-service-endpoint

      Feel free to share any observation to help others on same context.

      Have converted Harshita's comment to answer in case you found our inputs useful

      Thank you


    Sign in to comment
  2. Harshitha Eligeti 10 Reputation points Microsoft External Staff Moderator

    Hello @Amandeep Sadioura
    Based on what you described, this looks like a real-time ConversationTranscriber session that connects successfully (WebSocket up) but then stops after ~140s with no transcribed segments and no canceled error details.

    Does ConversationTranscriber (universal/v2) fully support Speech Private Endpoint?

    From the provided documentation, we can say the following:

    • For Speech Private Endpoint scenarios, the service expects the client to use a custom domain for the Speech resource (required for private endpoints) and then replace the host name in request URLs with that custom domain. The private-link article explains that the URL construction changes, while the rest of the path stays the same.
    • The doc also explicitly calls out that there are different endpoint sets for different Speech APIs (e.g., REST APIs vs SDK/other operations) and that you must use the correct endpoint URL pattern in private-link scenarios.
    • However, the provided documentation does not explicitly confirm whether ConversationTranscriber (universal/v2) diarization result delivery is covered entirely by the Private Endpoint sub-resource account for Speech in canadacentral, nor does it enumerate any additional sub-resources/endpoints specific to ConversationTranscriber beyond the general private-link guidance.

    So, using only the info available here: we can’t definitively answer whether account covers the entire ConversationTranscriber result path or whether additional private endpoints / endpoint transformations are required.

    What you can validate with the available guidance

    1. Custom domain / endpoint URL transformation
      • The private-link doc emphasizes that after you enable private endpoints (and thus a custom domain), you typically need to replace the host name in your SDK endpoint URLs with the custom domain host name.
      • If you’re currently using https://<resource>.cognitiveservices.azure.com/... rather than https://<custom-name>.cognitiveservices.azure.com/... (or the documented equivalent transformation for SDK), the connection may still establish, but other parts of the session/result flow can behave unexpectedly.
    2. Use SDK logging to capture the real cause
      • The “client issues” doc recommends enabling Speech SDK logging (by setting Speech_LogFilename) because it provides diagnostics and includes the session id—useful when diagnosing slow responses or cancellations.
      • Since you’re not seeing a CANCELED event, logging is especially important to determine what the service/runtime did during the session_stopped at ~140s.

    0 comments No comments

    Sign in to comment
Sign in to answer

Your answer