How can I retrieve images in a chunked Azure search index?

Nuno Rodrigues 20 Reputation points
2024-10-25T21:26:45.6433333+00:00

My issue is relatively simple.

I am using Azure Ai Search to index a set of documents. These documents are mostly PDFs but other formats as well. Most of them have embedded images. I know that in the document cracking phase these images are split from the document and stored in

/document/normalized_images/*/data

However, now I would like to store the base64 encoding of the image in the index. So I would like to get the chunk, its respective vector, and the base64 encoding of respective images (if they exist) on the index. For that I need to be able to process the normalized images and connect them with the right chunk.

From what I can see, this is not easy. First, I was expecting for the data field in

/document/normalized_images/*/data

to have the base64 encoding but no, I only see the name of the file like image3.jpg and in the chunk itself, I don't see any encoding either.

I have created a knowledge store to host the extracted images but I don't get any way of encoding them. I only get these fields:

User's image

None of these fields is an encoding of the image. ImageId is only an encoding of the name of the file.

Is there a better way to do this? I am willing to store the images in a knowledge store and retrieve them just as long as I can connect them in the index with the right chunk. The final objective is to be able to retrieve meaningful text and images when querying the index.

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,083 questions
0 comments No comments
{count} votes

2 answers

Sort by: Most helpful
  1. Grmacjon-MSFT 18,471 Reputation points
    2024-10-26T01:24:59.1566667+00:00

    Hi @Nuno Rodrigues what are the typical sizes and formats of the images you'll be indexing? Are there any specific image formats you need to support (e.g., JPEG, PNG, TIFF)?

    One way you can achieve your scenario is by creating a skillset with a custom skill that extracts the base64 encoded data for each image chunk during indexing. Here's a general breakdown:

    • Define a skillset that includes your custom skill.
    • Develop a custom skill that takes a chunk of the image data as input and outputs the base64 encoded data for that chunk.
    • Associate the skillset with your Azure Search index during creation or update. This makes sure the custom skill runs on each chunk during indexing.

    Another option is to leverage Azure AI Search's built-in vectorization capabilities. Here's a high-level process:

    • Configure your indexer to use integrated vectorization.
    • Consider using the built-in "Text Split" skill to split large documents by content boundaries before vectorization.
    • Choose an appropriate embedding model for image vectorization.
    • The indexing process automatically chunks large images, extracts image features using the embedding model, and indexes them for efficient retrieval.

    Best,

    -Grace


  2. Nuno Rodrigues 20 Reputation points
    2024-10-28T08:01:16.66+00:00

    I finally managed to accomplish this. The key is to split the mappings between the ones that are coming from the document/pages source and the /normalized_images source. You need to create the separate mappings and then you can populate each image/chunk object in the index how you want.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.