How to Update Changes in a Vector Database for PDF Content?

Archana Chaudhary 0 Reputation points
2024-11-18T09:56:48.7833333+00:00

Here’s a clearer and more detailed version of your question for the Azure forums:


I have a PDF file whose content has already been embedded into vectors and stored in a vector database. Recently, there were some changes made to the PDF. I want to update the corresponding vectors in the vector database to reflect these changes.

What would be the best approach to efficiently update or replace the existing vectors in the database without causing inconsistencies? Are there any specific APIs, tools, or best practices available for this purpose when working with Azure services?

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,083 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Archana Chaudhary 0 Reputation points
    2024-11-20T06:13:56.26+00:00

    Hi @Shree Hima Bindu Maganti
    Thanks for providing the solution approach.
    Currently, this is what the similar solution approach we had implemented before and it's in working state, but it's a bit time consuming.

    I was looking for the solution more related to the updated content from PDF.
    Consider while training the documents, which follows the process like:

    1. Passing the document files (PDF's) to Azure Doc Intelligence to get the text chunks
    2. Generate the Embeddings of text chunks and doc metadata to Vector Embeddings using Az OpenAI embeddings.
    3. Store these embeddings into vector DB (Azure AI Search).

    This is the pretty standard training process.
    In case of updated document files, I was looking into replacing the vector embeddings for updated content only.
    For example:
    If I have a PDF of 5 pages, and I am updating the content of page 2 and keeping all the content of other pages same, so while re-processing the doc file, I am expecting it should update the corresponding text chunks vector embeddings of only updated part of PDF file, keeping all the embeddings of other text chunks same as there is no updates in it.

    Also I am expecting the same scenario in case of doc file (CSV).

    Thanks in advance, appreciated your efforts.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.