Getting the size of parquet files from azure blob storage

Keerthana J 66 Reputation points
2024-07-09T11:57:50.94+00:00

I have a blob container abcd

The folder structure is like below:

abcd/Folder1/Folder a, Folder b…..Folder z

Inside a particular Folder a/v1/full/20230505/part12344.parquet

Similarly Folder b/v1/full/20230505/part9385795.parquet

Scenario is I need to get the size of each parquet files present in each folders a to z. I can see that there’s no get data size anymore in metadata activity in ADF. What else can be done here using ADF or ADB code?

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,485 questions
Azure Blob Storage
Azure Blob Storage
An Azure service that stores unstructured data in the cloud as blobs.
2,920 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,218 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,843 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Amrinder Singh 5,155 Reputation points Microsoft Employee
    2024-07-09T13:07:13.4+00:00

    Hi KEERTHANA JAYADEVAN - Thanks for reaching out.

    I would recommend enabling blob inventory report for the scenario. You need to ensure that you have "Content Length" field in the same.

    https://learn.microsoft.com/en-us/azure/storage/blobs/blob-inventory

    You can then further leverage synapse or databricks to parse the report further and get the required details.

    https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-inventory-report-analytics

    https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blob-calculate-container-statistics-databricks

    Hope that helps!

    Let me know if there are any further queries/concerns, will be glad to assist.


    Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.


  2. Nehruji R 8,146 Reputation points Microsoft Vendor
    2024-07-10T10:04:36.8133333+00:00

    Hello KEERTHANA JAYADEVAN,

    Greetings! Welcome to Microsoft Q&A Platform.

    As said above, to get the size of each parquet files present in each folders a to z. You can achieve this using either Azure Data Factory (ADF) or Azure Databricks (ADB).

    On using Azure Data Factory (ADF),

    Get Metadata Activity: Configure it to list the files in each folder.

    ForEach Activity: Use the output of the Get Metadata activity to iterate over each file.

    Get Metadata Activity: Inside the ForEach activity, use another Get Metadata activity to get the size of each file.

    Store Results: Use a Copy Data activity or another appropriate activity to store the file sizes.

    This approach should help you get the size of each Parquet file in your blob container.

    When using Azure Databricks (ADB), you can use PySpark to list and get the size of each Parquet file as same as above.

    If you need to Calculate the size/capacity of storage account and it services (Blob/Table) How to get the total size allocated to a Storage account and the for types like Queues, tables, blobs and files.

    This article uses the Azure Blob Storage inventory feature and Azure Synapse to calculate the blob count and total size of blobs per container. These values are useful when optimizing blob usage per container. Calculate blob count and total size per container using Azure Storage inventory

    You can use the below CLi command and followed Microsoft-Document as below:

    az
    

    Get report of file sizes from Azure Blob Storage

    How to get Azure Blob file size

    You can use Azure Storage Analytics to identify the largest files in your Blob storage. Storage Analytics provides detailed metrics and logs that you can use to monitor and troubleshoot your storage account. Here are the steps to enable Storage Analytics and view the metrics:

    1. Enable Storage Analytics: In the Azure portal, navigate to your storage account. Select the "Monitoring" tab, and then select "Storage Analytics". Click "Add policy" to create a new policy. Choose the metrics and logs you want to collect, and then click "Save".
    2. View the metrics: In the Azure portal, navigate to your storage account. Select the "Monitoring" tab, and then select "Metrics". Choose the metrics you want to view, and then select the time range you want to view. You can view the metrics for the entire storage account or for individual containers.

    Identify the largest files: In the metrics view, you can see the total size of your Blob storage and the number of blobs in each container. You can use this information to identify the containers that are using the most storage. To identify the largest files within a container, you can use a tool like Azure Storage Explorer or Azure CLI to sort the blobs by size.

    Hope this answer helps! Please let us know if you have any further queries. I’m happy to assist you further.


    Please "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.