Azure OpenAI Realtime WebSocket returns replacement characters (U+FFFD) in Chinese/Japanese transcripts and translations

YangQi 0 Reputation points

Summary

When using Azure OpenAI Realtime over WebSocket for live speech transcription and translation, we intermittently receive replacement characters (U+FFFD, rendered as ) in Chinese and Japanese text.

This affects:

  • Chinese speech transcription (input transcript)
  • Japanese speech transcription (input transcript)
  • Chinese/Japanese translation output (output transcript)

In practice, expected CJK text sometimes becomes corrupted, for example �果 or �果 fragments.

Environment

  • Service: Azure OpenAI
  • Mode: Realtime API via WebSocket
  • Endpoints used:
    • Translation: /openai/v1/realtime/translations?model=<deployment>
    • Transcription intent: /openai/v1/realtime?intent=transcription
  • Audio format: PCM 24kHz (streamed chunks)
  • Client path:
    • Browser -> backend WebSocket bridge -> Azure Realtime WebSocket
    • (We also tested direct browser-to-Azure WebSocket; same symptom)
  • Language focus: Chinese (zh), Japanese (ja)

Detailed Observations

1) Replacement characters appear in input transcript deltas

In session.input_transcript.delta (or equivalent transcription events), U+FFFD appears intermittently.

Example (simplified):


{

 "type": "session.input_transcript.delta",

 "delta": "...�果..."

}

2) Replacement characters can also appear in translation output

We see similar issues in session.output_transcript.delta and sometimes in done/final text.

Example (simplified):


{

 "type": "session.output_transcript.delta",

 "delta": "...�..."

}

3) Decrypted network payload already contains corrupted text

After decrypting WebSocket traffic using Wireshark + TLS key log, the payload itself already contains corrupted text, which suggests this is not only a frontend rendering issue.

Example snippet from decrypted payload text:

  • "delta":"�果"

Reproduction Steps

  1. Open a Realtime WebSocket session (translation or transcription).
  2. Stream continuous Chinese/Japanese speech (normal speaking pace, multiple phrases).
  3. Capture and log events:
    • session.input_transcript.delta
    • session.input_transcript.done
    • session.output_transcript.delta
    • session.output_transcript.done
  4. After running for several minutes, observe intermittent or mojibake-like segments.
  5. Decrypt traffic with TLS key log; the issue is still visible in WebSocket payload text.

Expected Behavior

  • Chinese/Japanese text should be returned as valid and stable UTF-8 without U+FFFD ().
  • No mojibake-like fragments such as � should appear in normal transcript/translation output.

Actual Behavior

  • U+FFFD appears intermittently in ongoing sessions and breaks sentence meaning.
  • The issue appears in both input transcription and translation output.
  • Reproducible in both direct WebSocket and backend-bridged WebSocket access patterns.

What We Already Checked

  • Frontend rendering is not the only cause (decrypted payload already contains corrupted text).
  • The issue is intermittent, not tied to a single fixed sentence.
0 comments No comments

Sign in to comment

1 answer

  1. Sina Salam 30,166 Reputation points Volunteer Moderator

    Hello YangQi,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that your Azure OpenAI Realtime WebSocket returns replacement characters (U+FFFD) in Chinese/Japanese transcripts and translations.

    I observed that you have incorrect client-side decoding of Realtime WebSocket payloads. You will need to fix by reading the complete WebSocket message, parse the JSON event, handle text deltas as text, and base64-decode audio deltas as audio bytes. Do not call Encoding.UTF8.GetString() on audio payloads, and do not solve this by trimming or suppressing ; that only hides the symptom and does not fix the broken audio/message handling. - https://learn.microsoft.com/en-us/dotnet/api/system.net.websockets.clientwebsocket.receiveasync?view=net-10.0, https://learn.microsoft.com/en-us/dotnet/api/system.net.websockets.websocketmessagetype?view=net-10.0, https://learn.microsoft.com/en-us/dotnet/core/compatibility/core-libraries/9.0/binaryreader

    The best practice resolution by steps is to:

    After correcting the WebSocket receive loop and separating text handling from audio-byte handling, the replacement-character issue is resolved because the client no longer attempts to decode audio or incomplete frames as UTF-8 text.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions, steps or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

    Sign in to comment
Sign in to answer

Your answer