Hi @Shree Hima Bindu Maganti
Thanks for providing the solution approach.
Currently, this is what the similar solution approach we had implemented before and it's in working state, but it's a bit time consuming.
I was looking for the solution more related to the updated content from PDF.
Consider while training the documents, which follows the process like:
- Passing the document files (PDF's) to Azure Doc Intelligence to get the text chunks
- Generate the Embeddings of text chunks and doc metadata to Vector Embeddings using Az OpenAI embeddings.
- Store these embeddings into vector DB (Azure AI Search).
This is the pretty standard training process.
In case of updated document files, I was looking into replacing the vector embeddings for updated content only.
For example:
If I have a PDF of 5 pages, and I am updating the content of page 2 and keeping all the content of other pages same, so while re-processing the doc file, I am expecting it should update the corresponding text chunks vector embeddings of only updated part of PDF file, keeping all the embeddings of other text chunks same as there is no updates in it.
Also I am expecting the same scenario in case of doc file (CSV).
Thanks in advance, appreciated your efforts.