How to count the distinct number of source documents in Azure AI Search Index

petermcnally 25 Reputation points
2024-12-03T14:46:54.9733333+00:00

Hello,

I would like to be able to count the number of source documents in my Index. Of course, the index shows the number of 'documents,' but this is really document chunks. I am using the UI in the portal to 'Import and Vectorize.' I see a discrepancy in the count between the number of documents in my Azure blob storage container and the number of documents processed by the Indexer.

I would like a way to be able to validate this with a distinct count of source documents in my Index. I have the metadata_storage_path exposed as a field. It seems pretty simple. I just want a distinct count. However, I cannot find a way to do this.

Thank you

Azure AI Search
Azure AI Search
An Azure search service with built-in artificial intelligence capabilities that enrich information to help identify and explore relevant content at scale.
1,142 questions
{count} votes

2 answers

Sort by: Most helpful
  1. petermcnally 25 Reputation points
    2024-12-12T17:31:20.4233333+00:00

    @Shree Hima Bindu Maganti thanks again for another response. That didn't quite work, but it got me in the right direction. Below is the code block I used to get the unique count of the values in metadata_storage_path.

    Hopefully this can help others. The count of unique source documents in the Azure AI Search UI would be much more valuable than the count of document chunks that is currently there.

        # Write the response JSON to a file
        with open('response.json', 'w') as json_file:
            json.dump(response_json, json_file, indent=4)
        
        # Extract and count distinct values in metadata_storage_path
        if '@search.facets' in response_json and 'metadata_storage_path' in response_json['@search.facets']:
            distinct_values = set()
            for item in response_json['@search.facets']['metadata_storage_path']:
                distinct_values.add(item['value'])
            print("Distinct metadata_storage_path count:", len(distinct_values))
        else:
            print("metadata_storage_path facet not found in the response.")
    
    0 comments No comments

  2. Shree Hima Bindu Maganti 1,620 Reputation points Microsoft Vendor
    2024-12-12T17:46:22.5433333+00:00

    Hi petermcnally ,

    Thankyou for your Response.

    I'm glad that you were able to resolve your issue and thank you for posting your solution so that others experiencing the same thing can easily reference this! Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", I'll repost your solution in case you'd like to accept the answer.

    Issue: How to count the distinct number of source documents in Azure AI Search Index

    I would like to be able to count the number of source documents in my Index. Of course, the index shows the number of 'documents,' but this is really document chunks. I am using the UI in the portal to 'Import and Vectorize.' I see a discrepancy in the count between the number of documents in my Azure blob storage container and the number of documents processed by the Indexer.

    I would like a way to be able to validate this with a distinct count of source documents in my Index. I have the metadata_storage_path exposed as a field. It seems pretty simple. I just want a distinct count. However, I cannot find a way to do this.

    Solution: The count of unique source documents in the Azure AI Search UI would be much more valuable than the count of document chunks that is currently there.

    # Write the response JSON to a file
        with open('response.json', 'w') as json_file:
            json.dump(response_json, json_file, indent=4)
        
        # Extract and count distinct values in metadata_storage_path
        if '@search.facets' in response_json and 'metadata_storage_path' in response_json['@search.facets']:
            distinct_values = set()
            for item in response_json['@search.facets']['metadata_storage_path']:
                distinct_values.add(item['value'])
            print("Distinct metadata_storage_path count:", len(distinct_values))
        else:
            print("metadata_storage_path facet not found in the response.")
    

    If this answers your query, do click Accept Answer and Yes for was this answer helpful.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.