Azure Open Ai Insert own data get error "Cracking and chunking - Data ingestion failed"

Kelvin Shee 55 Reputation points
2025-02-27T09:46:30.2766667+00:00

hi Expert,

I am using Azure AI Foundry and have created a project and hub. The model I am using is GPT-4o along with text-embedding-ada-002 on the Chat Playground. I added my data from a storage account using Azure AI Services and want it to detect my storage files and respond accordingly.

Below is the error I encountered when adding data to Azure OpenAI.error 1

crack_chunk_embed_log.txt

may i know how i can fix this issue?

thanks.

Azure AI services
Azure AI services
A group of Azure services, SDKs, and APIs designed to make apps more intelligent, engaging, and discoverable.
3,174 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Saideep Anchuri 3,140 Reputation points Microsoft Vendor
    2025-02-27T10:55:42.6533333+00:00

    Hi Kelvin Shee

    It looks like you're encountering a UnicodeDecodeError while trying to process data in Azure OpenAI.

    Here are some steps:

    1. Verify that the data you are attempting to decode is encoded in gb2312. If the data includes characters beyond the gb2312 encoding, consider using an alternative encoding like GB18030, which encompasses gb2312 and supports additional characters.
    2. If you think the data may include characters not supported by the gb2312 encoding, consider switching to GB18030
    with open('yourfile.txt', 'r', encoding='GB18030') as file:
        content = file.read()
    

    Azure OpenAI On Your Data supports the following file types:

    • .txt
    • .md
    • .html
    • .docx
    • .pptx
    • .pdf ,

    Kindly refer below Link: https://github.com/microsoft/sample-app-aoai-chatGPT/tree/main/scripts#optional-crack-pdfs-to-text For preprocessing longer text or mixed datatype.

    supported-data-sources

    Thank You.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.