Hi Aparna!
Welcome to Azure AI Q and A forum. Thank you for sharing your query.
We tried to repro the scenario with sample Japanese documents and able to get translated text without overlapping text.
import requests, uuid, json
from PyPDF2 import PdfReader
# Add your key and endpoint
key = "<endpointkey>"
endpoint = "<endpointurlfortext>"
path = '/translate'
constructed_url = endpoint + path
params = {
'api-version': '3.0',
'from': 'ja',
'to': ['en']
}
headers = {
'Ocp-Apim-Subscription-Key': key,
# location required if you're using a multi-service or regional (not global) resource.
'Ocp-Apim-Subscription-Region': 'northeurope',
'Content-type': 'application/json',
'X-ClientTraceId': str(uuid.uuid4())
}
# Read the content of the local PDF document
document_path = "/content/X_L02.pdf"
pdf_text = ""
with open(document_path, 'rb') as file:
pdf_reader = PdfReader(file)
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
pdf_text += page.extract_text()
# You can pass more than one object in body.
body = [{
'text': pdf_text
}]
response = requests.post(constructed_url, params=params, headers=headers, json=body)
translated_response = response.json()
print(json.dumps(translated_response, sort_keys=True, ensure_ascii=False, indent=4, separators=(',', ': ')))
Please try below options and let us know.
1.Use PyPDF2 or another advanced editor to keep the format of document text intact without corrupting it.
2.Use OCR
2.1. Use OCR to get text data from your docs and save them as text files
2.2. Translate the text file to English.
Thank you