Cognitive services STT batch transcription: incomplete/cut-off transcripts

STT 0 Reputation points

Hi,

The URLs/sources in this ticket have been replaced with placeholders, but are of course available for support upon request.

We are using batch transcription through the endpoint:

Locale: nl-NL

contentUrls: ['https://www.example.com/audio.wav']

properties: [timeToLiveHours: 6, displayFormWordLevelTimestampsEnabled: true, wordLevelTimestampsEnabled: true]

The file in question is 4h36m long.

On the first attempt, we received a transcript for the first 3.5 hours.

On the second attempt we received a transcript for less than 2 hours.

See more details below.

ffprobe -hide_banner audio.wav
Input #0, wav, from 'audio.wav':

 Metadata:

 encoder : Lavf58.29.100

 Duration: 04:36:20.37, bitrate: 256 kb/s

 Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz, 1 channels, s16, 256 kb/s

From another server:

curl -v -o /dev/null [https://www.example.com/audio.wav]()

Output ends with:

100 505M 0 505M 0 0 10.5M 0 --:--:-- 0:00:48 --:--:-- 9503k

* Connection #0 to host [www.example.com]() left intact

505M is the correct/total size of the file.

Most recent attempt, transcription 7199515d-f7df-4444-9d7d-7a61ff90fbf9

Part of the response JSON:

"source": "https://www.example.com/audio.wav",

"timestamp": "2026-05-18T13:33:45Z",

"durationInTicks": 68881900000,

"durationMilliseconds": 6888190,

"duration": "PT1H54M48.19S",

A previous attempt (76d0e7b8-0a53-4f4e-9a1c-a77da62a2555) with the exact same WAV file (which then had a different name/ID) yielded:

"duration": "PT3H33M41.25S",

In both cases the transcript is cut-off prematurely; the wav file contains speech beyond those cut-off points.

We are aware of the 240mins/4 hour maximum when including/requesting speaker diarization, so when we do need/request speaker diarization we cut the file into 4 hour chunks.

However, for this transcription we are not requesting speaker diarization.

Our initial thought was to maybe chunk the audio into 2 hour chunks and then glue the transcripts together (correcting timestamps).

However we are now also noticing this issue on much shorter audio files.

For example a 1h37m04s long audio file yields a transcript that says:

"duration": "PT1H16M54.02S",

So there's over 20 minutes missing.

Unfortunately this job was done over a month ago so cannot currently get the transcription ID from the logs anymore.

We are using the batch transcription API on a daily basis for multiple long audio files and it is of course paramount that the transcripts are complete.

Question: why do we not receive a transcript for the full file?
And why does this fail 'quietly', returning a success status, no error?

0 comments No comments

Sign in to comment

1 answer

  1. Jerald Felix 13,500 Reputation points Volunteer Moderator

    Hello STT,

    Greetings! Thanks for raising this question in Q&A forum.

    The issue of incomplete batch transcription in Azure Cognitive Services Speech-to-Text (STT) is a known challenge. This typically happens due to audio quality issues, job processing behavior, or configuration problems. Let me walk you through the common causes and fixes.

    Why Batch Transcription May Be Incomplete

    • Asynchronous processing not fully awaited: Batch transcription is async if you retrieve results before the job fully completes, you only get partial output
    • Silent failures with speaker diarization: If diarization is enabled and the audio has more than two speakers, the job may fail silently or produce incomplete results
    • Custom model issues: Transcription jobs using custom-trained models can encounter transient backend problems, resulting in partial or stuck transcriptions
    • Audio file format issues: Unsupported or malformed audio files in a batch may cause some files to be skipped silently
    • Peak hour delays: At peak times, jobs can take up to 30 minutes or longer to even start processing

    Step-by-Step Troubleshooting

    1. Verify job status properly Use the REST API to check the job status. Do not assume it's complete just because time has passed:
       text
       GET https://<region>.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions/<transcriptionId>
      
      Wait until the status shows Succeeded before fetching results
    2. Check individual file statuses — A batch job can succeed overall but have individual files that failed. Check the files endpoint:
       text
       GET https://<region>.api.cognitive.microsoft.com/speechtotext/v3.2/transcriptions/<transcriptionId>/files
      
      Look for files with status: Failed and note their error messages
    3. Disable speaker diarization temporarily — If diarization is enabled, turn it off and resubmit the job to check if it resolves the incomplete output
    4. Switch to a base model — If using a custom model, try resubmitting the job with the default base model to rule out custom model issues
    5. Poll at the right interval — Avoid polling every few seconds. Check status every 10 minutes at minimum; polling too frequently adds load and doesn't speed up processing

    Best Practices to Avoid Incomplete Transcriptions

    • Submit ~1,000 files per single Transcription_Create request for optimal throughput
    • Distribute requests across multiple Azure regions to balance load for large-scale workloads
    • Spread submissions over time rather than sending all jobs within a short burst
    • Ensure audio files are in supported formats and accessible via valid SAS URIs (not expired)

    If this answer helps you kindly accept the answer which will help others who have similar questions.

    Best Regards,

    Jerald Felix.

    1. STT 0 Reputation points

      Hi @Jerald Felix Thanks for your answer. In response to your troubleshooting steps:

      1. We already do this. We periodically poll the status after starting the job, and only fetch the transcript after the status is Succeeded
      2. Our batches only contain 1 file (we use the batch API because we do no need the response real time/a delay is fine, and of course the batch API is cheaper)
      3. Does not apply, in the cases described we were not using speaker diarization (and when we do, we provide maxSpeakers: 35)
      4. Does not apply, we are not using a custom model
      5. We do check more frequently than once every 10 minutes, and even though it obviously does not speed up processing, it seems unlikely that checking the status would cause the job to fail/stop prematurely (yet still return a Succeeded status)
    2. STT 0 Reputation points

      @Jerald Felix We upgraded our "free" support plan to the paid "developer" support with the understanding that this:

      "Business hours access to Support Engineers via prioritized responses on Microsoft Q&A"

      "Minimal business impact (Sev C): <8 business hours"

      Meant that there would be a response from an MS engineer.

      Can you as a moderator have this escalated? Or would this require the "standard" plan?


    Sign in to comment
Sign in to answer

Your answer