Getting overlapping text and extra spaces after translating a PDF. How can I fix this?

Question

After translating pdf document containing Japanese text to English, I am getting overlapping text and extra spaces how to fix this issue?

Answer

Hi Aparna!

Welcome to Azure AI Q and A forum. Thank you for sharing your query.

We tried to repro the scenario with sample Japanese documents and able to get translated text without overlapping text.

import requests, uuid, json
from PyPDF2 import PdfReader

# Add your key and endpoint
key = ""
endpoint = "" 
path = '/translate'
constructed_url = endpoint + path

params = {
    'api-version': '3.0',
    'from': 'ja',
    'to': ['en']
}

headers = {
    'Ocp-Apim-Subscription-Key': key,
    # location required if you're using a multi-service or regional (not global) resource.
    'Ocp-Apim-Subscription-Region': 'northeurope',
    'Content-type': 'application/json',
    'X-ClientTraceId': str(uuid.uuid4())
}

# Read the content of the local PDF document
document_path = "/content/X_L02.pdf"
pdf_text = ""

with open(document_path, 'rb') as file:
    pdf_reader = PdfReader(file)
    for page_num in range(len(pdf_reader.pages)):
        page = pdf_reader.pages[page_num]
        pdf_text += page.extract_text()

# You can pass more than one object in body.
body = [{
    'text': pdf_text
}]

response = requests.post(constructed_url, params=params, headers=headers, json=body)
translated_response = response.json()

print(json.dumps(translated_response, sort_keys=True, ensure_ascii=False, indent=4, separators=(',', ': ')))

Please try below options and let us know.

1.Use PyPDF2 or another advanced editor to keep the format of document text intact without corrupting it.

2.Use OCR

2.1. Use OCR to get text data from your docs and save them as text files

2.2. Translate the text file to English.

Thank you

Share via

Getting overlapping text and extra spaces after translating a PDF. How can I fix this?

1 answer

Your answer