Getting overlapping text and extra spaces after translating a PDF. How can I fix this?

Aparna 0 Reputation points
2025-02-10T11:06:28.4533333+00:00

After translating pdf document containing Japanese text to English, I am getting overlapping text and extra spaces how to fix this issue?

Azure Translator
Azure Translator
An Azure service to easily conduct machine translation with a simple REST API call.
444 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Manas Mohanty 295 Reputation points Microsoft Vendor
    2025-02-10T12:43:16.4666667+00:00

    Hi Aparna!

    Welcome to Azure AI Q and A forum. Thank you for sharing your query.

    We tried to repro the scenario with sample Japanese documents and able to get translated text without overlapping text.

    import requests, uuid, json
    from PyPDF2 import PdfReader
    
    # Add your key and endpoint
    key = "<endpointkey>"
    endpoint = "<endpointurlfortext>" 
    path = '/translate'
    constructed_url = endpoint + path
    
    params = {
        'api-version': '3.0',
        'from': 'ja',
        'to': ['en']
    }
    
    headers = {
        'Ocp-Apim-Subscription-Key': key,
        # location required if you're using a multi-service or regional (not global) resource.
        'Ocp-Apim-Subscription-Region': 'northeurope',
        'Content-type': 'application/json',
        'X-ClientTraceId': str(uuid.uuid4())
    }
    
    # Read the content of the local PDF document
    document_path = "/content/X_L02.pdf"
    pdf_text = ""
    
    with open(document_path, 'rb') as file:
        pdf_reader = PdfReader(file)
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            pdf_text += page.extract_text()
    
    # You can pass more than one object in body.
    body = [{
        'text': pdf_text
    }]
    
    response = requests.post(constructed_url, params=params, headers=headers, json=body)
    translated_response = response.json()
    
    print(json.dumps(translated_response, sort_keys=True, ensure_ascii=False, indent=4, separators=(',', ': ')))
    

    Please try below options and let us know.

    1.Use PyPDF2 or another advanced editor to keep the format of document text intact without corrupting it.

    2.Use OCR

    2.1. Use OCR to get text data from your docs and save them as text files

    2.2. Translate the text file to English.

    Thank you

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.