Single Document Translation API does not pickup some words in glossary file
We use Azure's single document translation REST API and use TSV file based glossaries. I have the following glossary:
| English | Italian |
|---|---|
| Remittance | Rimesse |
However the response from Azure only considers it only once (and kept it in difference case than the one given in glossary file). But other occurences are not considered (eg: second line in the following screenshot where I have the exact same case.👁 User's image
Please can you urgently suggest how to resolve this problem?
2 answers
-
Thanmayi Godithi 10,655 Reputation points • Microsoft External Staff • Moderator
Hi @II, — this behavior is expected with Azure Document Translation when using TSV‑based glossaries. A glossary entry (for example, Remittance → Rimesse) is treated as a strong hint, not a guaranteed replace, so it may apply to the first occurrence but not every subsequent one — even when the casing looks identical. This is due to how the neural translation engine and segment‑level processing work, not an issue with your TSV alone.
That said, here are the key things to double‑check and ways to improve consistency:
- Glossary file formatting
- Ensure the TSV has exactly two columns (source, target) and no header row.
- Save the file as UTF‑8 without BOM.
- Use real tab characters (not spaces).
- Make sure the glossary format is explicitly set to
"format": "tsv"in the request.
- Case and punctuation handling
- Glossary matching is token‑ and segment‑based.
- If a term appears with punctuation (for example,
Remittance.), that variant may not match. - To improve coverage, add explicit casing variants to the TSV: Remittance Rimesse remittance rimesse REMITTANCE RIMESSE ``` 1
- Segment‑level behavior
- Each glossary entry is typically applied once per segment.
- If multiple occurrences are in the same sentence/segment, only the first may be replaced.
- Splitting long sentences or adding clearer sentence boundaries can help.
- Validation and workaround
- Test with a minimal document containing the same word repeated across multiple lines to confirm segment behavior.
- If strict terminology enforcement is required, a post‑translation find/replace step is a common workaround.
- You can also test whether the Batch Document Translation API gives more consistent results for your content.
If the issue persists even with identical casing and simplified input, please let us know.
Hope this helps clarify the behavior and next steps.
-
Jerald Felix 13,500 Reputation points • Volunteer Moderator
Hello II,
Greetings!
Thanks for raising this question in Q&A forum.
This is a known behaviour with Azure's Document Translation API when using glossary files. Here's what's happening: the glossary in Azure Translator is not guaranteed to apply to every single occurrence of a term in the document. The translation engine uses the glossary as a strong hint, but it still runs a neural machine translation model underneath. This means it may apply the glossary term for the first occurrence but then make its own decision for subsequent ones — especially if the surrounding sentence context or casing differs. This is a design limitation of how glossaries interact with the neural translation engine, not a bug in your TSV file.
Here are some practical steps to improve glossary consistency:
Step 1: Verify your TSV file formatting
Make sure your TSV file has no extra spaces, BOM characters, or hidden formatting. The columns must be separated by a proper tab character, not spaces. You can open the file in Notepad++ and enable "Show All Characters" to confirm tabs are clean.
Step 2: Add case variations to your glossary
Add multiple rows in your TSV to cover different casings of the same word. For example:
Remittance Rimesse remittance rimesse REMITTANCE RIMESSEThis gives the engine explicit instructions for each casing variant it might encounter.
Step 3: Use the
caseSensitiveglossary option if availableWhen making your API call, check if you are passing the glossary with case sensitivity settings. Ensuring case-insensitive matching can help catch all variations of a term.
Step 4: Keep sentences short and consistent in the source document
The neural model is more likely to override glossary entries in long, complex sentences where it decides a different phrasing fits better. Where possible, break up long sentences in your source document to give the glossary a better chance of being applied.
Step 5: Test with a minimal document
Create a simple test document with just 3 to 5 lines all containing the word "Remittance" in the same casing, and run the translation again. This helps confirm whether the issue is casing-related, sentence context-related, or a more specific API behaviour.
Step 6: Consider Batch Document Translation for more control
If consistent glossary application is critical for your use case, the Batch Document Translation API sometimes handles glossaries more reliably across large documents compared to the single document endpoint. It may be worth testing the same document through the batch endpoint to compare results.
Step 7: Raise a support ticket if the issue persists
If after trying all the above the glossary is still being skipped for identical terms in the same casing, this warrants a proper bug report. Open an Azure Support request with your TSV file, the source document, and the translated output as attachments so the Azure Translator team can investigate the specific skipped occurrences.
As a general expectation to set: Azure Translator's glossary is designed to be a best-effort override, not a guaranteed 100% replacement for every occurrence. For strict terminology enforcement across all occurrences, a post-processing find-and-replace step on the translated output is often used as a practical workaround alongside the glossary.
If this answer helps you kindly accept the answer which will help others who have similar questions.
Best Regards,
Jerald Felix.
