How to Update Changes in a Vector Database for PDF Content?

Question

Here’s a clearer and more detailed version of your question for the Azure forums:

I have a PDF file whose content has already been embedded into vectors and stored in a vector database. Recently, there were some changes made to the PDF. I want to update the corresponding vectors in the vector database to reflect these changes.

What would be the best approach to efficiently update or replace the existing vectors in the database without causing inconsistencies? Are there any specific APIs, tools, or best practices available for this purpose when working with Azure services?

Answer

Hi @Shree Hima Bindu Maganti
Thanks for providing the solution approach.
Currently, this is what the similar solution approach we had implemented before and it's in working state, but it's a bit time consuming.

I was looking for the solution more related to the updated content from PDF.
Consider while training the documents, which follows the process like:

Passing the document files (PDF's) to Azure Doc Intelligence to get the text chunks
Generate the Embeddings of text chunks and doc metadata to Vector Embeddings using Az OpenAI embeddings.
Store these embeddings into vector DB (Azure AI Search).

This is the pretty standard training process.
In case of updated document files, I was looking into replacing the vector embeddings for updated content only.
For example:
If I have a PDF of 5 pages, and I am updating the content of page 2 and keeping all the content of other pages same, so while re-processing the doc file, I am expecting it should update the corresponding text chunks vector embeddings of only updated part of PDF file, keeping all the embeddings of other text chunks same as there is no updates in it.

Also I am expecting the same scenario in case of doc file (CSV).

Thanks in advance, appreciated your efforts.

Share via

How to Update Changes in a Vector Database for PDF Content?

1 answer

Your answer