Azure OpenAI Realtime WebSocket returns replacement characters (U+FFFD) in Chinese/Japanese transcripts and translations

👁 Image

YangQi 0 Reputation points

Summary

When using Azure OpenAI Realtime over WebSocket for live speech transcription and translation, we intermittently receive replacement characters (U+FFFD, rendered as �) in Chinese and Japanese text.

This affects:

Chinese speech transcription (input transcript)
Japanese speech transcription (input transcript)
Chinese/Japanese translation output (output transcript)

In practice, expected CJK text sometimes becomes corrupted, for example �果 or ï¿½æžœ fragments.

Environment

Service: Azure OpenAI
Mode: Realtime API via WebSocket
Endpoints used:
- Translation: /openai/v1/realtime/translations?model=<deployment>
- Transcription intent: /openai/v1/realtime?intent=transcription
Audio format: PCM 24kHz (streamed chunks)
Client path:
- Browser -> backend WebSocket bridge -> Azure Realtime WebSocket
- (We also tested direct browser-to-Azure WebSocket; same symptom)
Language focus: Chinese (zh), Japanese (ja)

Detailed Observations

1) Replacement characters appear in input transcript deltas

In session.input_transcript.delta (or equivalent transcription events), U+FFFD appears intermittently.

Example (simplified):


{

 "type": "session.input_transcript.delta",

 "delta": "...�果..."

}

2) Replacement characters can also appear in translation output

We see similar issues in session.output_transcript.delta and sometimes in done/final text.

Example (simplified):


{

 "type": "session.output_transcript.delta",

 "delta": "...�..."

}

3) Decrypted network payload already contains corrupted text

After decrypting WebSocket traffic using Wireshark + TLS key log, the payload itself already contains corrupted text, which suggests this is not only a frontend rendering issue.

Example snippet from decrypted payload text:

"delta":"ï¿½æžœ"

Reproduction Steps

Open a Realtime WebSocket session (translation or transcription).
Stream continuous Chinese/Japanese speech (normal speaking pace, multiple phrases).
Capture and log events:
- session.input_transcript.delta
- session.input_transcript.done
- session.output_transcript.delta
- session.output_transcript.done
After running for several minutes, observe intermittent � or mojibake-like segments.
Decrypt traffic with TLS key log; the issue is still visible in WebSocket payload text.

Expected Behavior

Chinese/Japanese text should be returned as valid and stable UTF-8 without U+FFFD (�).
No mojibake-like fragments such as ï¿½ should appear in normal transcript/translation output.

Actual Behavior

U+FFFD appears intermittently in ongoing sessions and breaks sentence meaning.
The issue appears in both input transcription and translation output.
Reproducible in both direct WebSocket and backend-bridged WebSocket access patterns.

What We Already Checked

Frontend rendering is not the only cause (decrypted payload already contains corrupted text).
The issue is intermittent, not tied to a single fixed sentence.

0 comments No comments

1 answer

👁 Image

Sina Salam 30,166 Reputation points • Volunteer Moderator
Hello YangQi,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that your Azure OpenAI Realtime WebSocket returns replacement characters (U+FFFD) in Chinese/Japanese transcripts and translations.

I observed that you have incorrect client-side decoding of Realtime WebSocket payloads. You will need to fix by reading the complete WebSocket message, parse the JSON event, handle text deltas as text, and base64-decode audio deltas as audio bytes. Do not call Encoding.UTF8.GetString() on audio payloads, and do not solve this by trimming or suppressing �; that only hides the symptom and does not fix the broken audio/message handling. - https://learn.microsoft.com/en-us/dotnet/api/system.net.websockets.clientwebsocket.receiveasync?view=net-10.0, https://learn.microsoft.com/en-us/dotnet/api/system.net.websockets.websocketmessagetype?view=net-10.0, https://learn.microsoft.com/en-us/dotnet/core/compatibility/core-libraries/9.0/binaryreader

The best practice resolution by steps is to:
- Use the correct Azure OpenAI Realtime endpoint format for the selected API version.
- Accumulate WebSocket frames until EndOfMessage before decoding.
- Decode only WebSocket text messages as UTF-8 JSON.
- Parse the event type before processing the payload.
- Treat response.audio.delta as base64 audio data, not text.
- Configure matching input_audio_format and output_audio_format.
- Use WebRTC instead of WebSocket for low-latency browser or mobile client audio, while keeping WebSocket for server-to-server or middleware scenarios. - https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/realtime-audio-websockets, https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/realtime-audio
After correcting the WebSocket receive loop and separating text handling from audio-byte handling, the replacement-character issue is resolved because the client no longer attempts to decode audio or incomplete frames as UTF-8 text.

I hope this is helpful! Do not hesitate to let me know if you have any other questions, steps or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.
0 comments No comments

Sign in to comment

URL: https://learn.microsoft.com/en-us/answers/questions/5921323/azure-openai-realtime-websocket-returns-replacemen