I have noticed an issue where the training mechanism seems to be adding unexpected characters to the test-output/trained model, but only for one of my models. I have two separate Custom Models that I am training via Azure Translator. One for each language (es-MX and fr-CA) There are 3 documents I have prepared and provided in order to train each of these models: Human translated phrases from English to [target language] Vernacular Dictionary (Domain-specific phrases for Proper nouns commonly used throughout our system) Token-based Dictionary (Phrases throughout our website use tokens for singular/plural representations of commonly used Proper nouns as well as additional use cases where we do string-interpolation on portions of phrases that we translate. For example: "There are {0} course(s) in this Curriculum" or "You have {submissionCount} submissions available for review" We do not want to translate these place holders annotated within these brackets (or similar ones), so these have been defined within said-dictionary The Spanish model seems to be correct every time and I have no issues when training/translating tokens as needed. The French model however, provides output like this in both the Test output and the translations I request via the model's API call. Instead of obtaining the representation in French for these sentences: "There are {0} course(s) in this Curriculum" or "You have {submissionCount}" They wind up with an extra curly brace for the tokens like this: "There are { {0} course(s) in this Curriculum" or "You have { {submissionCount}" Please Advise

Custom translator model is generating additional/unnecessary characters

Tyler Chlumecky 20

I have noticed an issue where the training mechanism seems to be adding unexpected characters to the test-output/trained model, but only for one of my models.

I have two separate Custom Models that I am training via Azure Translator.

One for each language (es-MX and fr-CA)

There are 3 documents I have prepared and provided in order to train each of these models:

Human translated phrases from English to [target language]
Vernacular Dictionary (Domain-specific phrases for Proper nouns commonly used throughout our system)
Token-based Dictionary (Phrases throughout our website use tokens for singular/plural representations of commonly used Proper nouns as well as additional use cases where we do string-interpolation on portions of phrases that we translate.
1. For example: "There are {0} course(s) in this Curriculum" or "You have {submissionCount} submissions available for review"
  1. We do not want to translate these place holders annotated within these brackets (or similar ones), so these have been defined within said-dictionary

The Spanish model seems to be correct every time and I have no issues when training/translating tokens as needed.

The French model however, provides output like this in both the Test output and the translations I request via the model's API call.

Instead of obtaining the representation in French for these sentences:
"There are {0} course(s) in this Curriculum" or "You have {submissionCount}"

They wind up with an extra curly brace for the tokens like this:
"There are {{0} course(s) in this Curriculum" or "You have {{submissionCount}"

Please Advise

kothapally Snigdha 1,185 Reputation points Microsoft Vendor

2025-01-15T17:06:45.8033333+00:00
Hi Tyler Chlumecky

Greetings & Welcome to the Microsoft Q&A forum! Thank you for sharing your query.

The issue you're experiencing with the French model in Azure Translator, where tokens are outputting with an extra curly brace (e.g., "{{0}" instead of "{0}"), could stem from several factors. here are some troubleshooting steps to follow.

Ensure that your token-based dictionary is correctly aligned. The Custom Translator does not sentence-align dictionary files, so it's crucial that there are equal numbers of source and target phrases, and that they are precisely aligned. If there is any misalignment, it can lead to unexpected behavior during translation.

Double-check the format of your token-based dictionary. It should explicitly define how tokens are treated. For example, if your dictionary includes entries for tokens, they should be formatted correctly without any additional characters. Make sure that the entries for tokens like {0} and {submission Count**}** are correctly specified without extra braces.

The quality and quantity of your training data can significantly impact the model's performance. Ensure that you have enough parallel sentences (at least 10,000) in your training set to allow for effective learning. If the model lacks sufficient training data, it may not handle tokens as expected

If adjustments have been made to your dictionaries or training data, consider retraining the model to see if the issue persists. Sometimes, changes in the training setup can lead to improvements in how the model interprets tokens

Test the model with various inputs to see if the issue is consistent across different sentences or specific to certain phrases. This can help identify if it's a broader issue with the model or isolated to specific cases.

I hope this helps you. Thank you.
Tyler Chlumecky 20 Reputation points

2025-01-15T21:37:52+00:00
Hi, I will address the relevant bullet points you have listed out as I have already taken these steps to avoid this issue:

Ensuring a clean/aligned dictionary set

Here's a small example of the dictionary set I've provided and used for training. Again, this is no different than my other model that doesn't experience this issue. It is excel-based so all of the columns are guaranteed to be in alignment.

Ensuring minimum training data requirement is met

As you can see below, we are over the minimum requirement for training data, so this should not be an issue either. It is worth noting that there are several instances of these types of phrases with {SomeToken} within both the English/target-language training data. However, once the training is ran, I receive the output I have mentioned in this ticket when viewing the Test-runs.
kothapally Snigdha 1,185 Reputation points Microsoft Vendor

2025-01-18T04:48:23.51+00:00

Hi Tyler Chlumecky,

sorry for delay.

Thank you for providing the details of your training process. here are a few suggestions to address the issue you're encountering with the test run output:

While your dictionary set is aligned, please ensure there are no hidden characters or extra spaces in your Excel data, as these can sometimes cause issues.

For phrases containing tokens like {Some Token}, double-check that they are consistently defined and handled properly in both your training data and the model configuration.

Since you're meeting the minimum data requirement, try running smaller subsets of the data to identify if any specific portion is causing the issue.

Review your training logs and outputs carefully to see if there’s any pattern or misprocessing of the placeholders during test runs.

I hope this helps you. Thank you.
Tyler Chlumecky 20 Reputation points

2025-01-20T14:42:54.3933333+00:00
Hi @kothapally Snigdha, can you clarify what you mean by training logs?

Since I have duplicated the columns [for my dictionary mapping] and I am essentially using the same data set for both languages, this is a non-issue. There are no additional/accidental characters within my dictionary mapping.

I'm not sure what you mean by either of the following here:

Smaller sub sets of data - If the minimum is 10k, I don't know how I would get around that?

Regarding logs, the closest thing I can see to that is the test results from the training itself. As explained, I can see that the test results immediately start producing the undesired output I've stated within this ticket. Is this what you're referring to or is there some area within the portal that I can see training log output instead?
kothapally Snigdha 1,185 Reputation points Microsoft Vendor

2025-01-22T05:13:46.8033333+00:00
Hi Tyler Chlumecky,

sorry for delay.

The term "training logs" typically refers to records of the training process, which may include details about the training iterations, performance metrics, and any errors encountered during training. In the context of the Custom Translator, the closest equivalent you might find is the test results that show how well the model is performing after training. Unfortunately, there may not be a dedicated section for detailed training logs within the portal.

The minimum requirement of 10,000 sentences for training is a guideline to ensure that the model has enough data to learn effectively. If you are struggling to meet this requirement, you might consider gathering more data or ensuring that your training documents are sufficiently diverse and representative of the language pair you are working with.

If you are using the same dataset for both languages and have verified that there are no additional characters in your dictionary mapping, you may want to review the alignment and ensure that the sentences are correctly paired without any discrepancies. can you please refer this document https://learn.microsoft.com/en-us/azure/ai-services/translator/custom-translator/concepts/sentence-alignment#next-steps

I hope this helps you. Thank you.
Tyler Chlumecky 20 Reputation points

2025-01-22T18:00:28.0933333+00:00

Hi, I do not feel like I am getting anywhere with these responses.

In the 1st bullet of your last response, what I gathered was that you are asking me to look at something that isn't available within the system? I'm unsure of how this helps me with what I'm doing?

To your 2nd bullet point, I am well aware that 10k is the minimum number of records needed for training. Without having this, I wouldn't have been able to do the training in the first place so I'm unsure as to why this is worth mentioning? The only reason I had mentioned running less than 10k was in response to your other suggestion stated here as it did not make sense to me: "Since you're meeting the minimum data requirement, try running smaller subsets of the data to identify if any specific portion is causing the issue."

So maybe you can clarify that point please?
kothapally Snigdha 1,185 Reputation points Microsoft Vendor

2025-01-23T04:23:35.96+00:00

Hi Tyler Chlumecky,

I agree that this issue looks strange, and I wasn't able to reproduce this issue. If you have a support plan, could you please file a support ticket for deeper investigation and do share the SR# with us? In case if you don't have a support plan, please let us know here.
Tyler Chlumecky 20 Reputation points

2025-01-23T20:46:52.4133333+00:00

Hi @kothapally Snigdha , at this time we do not have a support plan as we are still attempting to prove this out for the use case we would like to leverage this for.
kothapally Snigdha 1,185 Reputation points Microsoft Vendor

2025-01-27T17:57:00.0033333+00:00
Hi Tyler Chlumecky,

Could please try this once.

Go to the Azure portal and navigate to your Speech Services resource.

Click on the "Support + troubleshooting" tab.

Fill out the required information, including a detailed description of the issue and any steps you have taken to troubleshoot it.

Submit the support request. https://portal.azure.com/#view/Microsoft_Azure_Support/HelpAndSupportBlade/~/overview

I hope this helps you. Thank you.

Share via

Custom translator model is generating additional/unnecessary characters

Ensuring a clean/aligned dictionary set

Your answer