Problem translating html text

FRJohan 1 Reputation point
2025-02-13T08:57:40.6966667+00:00

We use the Azure Translator 3.0 API to translate from Swedish to English, but have encountered instances where the html tags negatively affect the translation. If we send plain text, the translation will be correct.

This is an excerpt from a longer text. Example:

HTML: <strong>Däremot ersätter</strong> Tjava inte lätt topptursutrustning, vilket en del tror.
Azure translate result: <strong>However, the</strong> Don't lightly fuss about ski touring equipment, as some people think.

PLAIN: Däremot ersätter Tjava inte lätt topptursutrustning, vilket en del tror.
Azure translate result: However, Tjava does not replace light ski touring equipment, as some people think.

Azure Translator
Azure Translator
An Azure service to easily conduct machine translation with a simple REST API call.
445 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Manas Mohanty 465 Reputation points Microsoft Vendor
    2025-02-13T15:27:00.2866667+00:00

    Hi FRJohan!

    It seems that the HTML tags are causing the Azure Translator 3.0 API to misinterpret the structure of the sentence, leading to incorrect translations. To mitigate this issue, you can preprocess the text to remove or handle HTML tags before sending it to the translation API.

    Here are a few approaches you can consider:

    1. Remove HTML Tags:

    Strip all HTML tags from the text before sending it to the translator. This ensures that the translation process focuses solely on the plain text.

    Here's a simple example in Python:

       from bs4 import BeautifulSoup
       def strip_html_tags(text):
           soup = BeautifulSoup(text, "html.parser")
           return soup.get_text()
       html_text = "<strong>Däremot ersätter</strong> Tjava inte lätt topptursutrustning, vilket en del tror."
       plain_text = strip_html_tags(html_text)
       print(plain_text)
    
      Output
       
    

    2. Preserve HTML Tags:

    Translate the text while preserving the HTML tags. This can be done by splitting the text into segments with and without HTML tags, translating the plain text segments, and then recombining them with the tags.

    Here's an example approach using Python:

       
    import re
    def translate_preserving_tags(html_text, translator_client):
           segments = re.split(r'(<[^>]+>)', html_text)
           translated_segments = []
           for segment in segments:
               if re.match(r'<[^>]+>', segment):
                   translated_segments.append(segment)  # HTML tag
               else:
                   translated_segment = translator_client.translate(segment)  # Translate plain text segment
                   translated_segments.append(translated_segment)
           return ''.join(translated_segments)
    
    html_text = "<strong>Däremot ersätter</strong> Tjava inte lätt topptursutrustning, vilket en del tror."
     
    translated_text = translate_preserving_tags(html_text, translator_client)
       print(translated_text)
    

    By preprocessing the text or adjusting how the translation is handled, you can get the desired answer

    If you have any other questions, please let me know. Thank you again for your time and patience throughout this issue.

    Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.

    Thank You.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.