How to read a large 50GB of file in azure function app

PS 0 Reputation points
2025-01-21T17:39:25.39+00:00

I have several 50GB+ files in Azure Blob and Azure File share. I like to read them via azure function. what will be best performance approach, the file has only read operation no write operation. to gain highest performance I need file mounted so python can read fast unless network ip is superfast.

Azure Files
Azure Files
An Azure service that offers file shares in the cloud.
1,349 questions
Azure Functions
Azure Functions
An Azure service that provides an event-driven serverless compute platform.
5,351 questions
{count} votes

2 answers

Sort by: Most helpful
  1. hossein jalilian 9,700 Reputation points
    2025-01-21T17:47:55.6633333+00:00

    Thanks for posting your question in the Microsoft Q&A forum.

    If using Azure Blob Storage:

    • Use the Azure Blob Storage SDK for efficient blob access.
    • Implement parallel reading of blob segments to improve performance1.
    • Consider using a buffer size of 16 MiB when reading from Azure File Share, as it has shown optimal performance for large files

    Use Event Grid triggers instead of Blob triggers for better scalability with large files6. Consider preprocessing large files into smaller, more manageable chunks before analysis7. If possible, use a Premium plan or Dedicated (App Service) plan for Azure Functions to access more powerful hardware and longer execution times


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful

    0 comments No comments

  2. Keshavulu Dasari 3,095 Reputation points Microsoft Vendor
    2025-01-21T18:32:00.4966667+00:00

    Hi PS,

    Greetings & Welcome to Microsoft Q&A forum! Thanks for posting your query!

    Adding more information to the above response!

    The best performance for reading large files (50GB+) from Azure Blob Storage and Azure File Share using Azure Functions,

    For Azure Blob Storage:

    Instead of downloading the entire file, use streaming to read data in chunks. This reduces memory consumption and improves performance. You can use the BlobClient and BlobStreamReader classes from the Azure Storage Blob SDK for Python, Adjust the chunk size to balance between memory usage and performance. Larger chunks can reduce the number of network calls but may increase memory usage.

    Use parallelism to download multiple chunks concurrently. This can significantly speed up the reading process, ensure that your Azure Function and Blob Storage are in the same region to minimize latency. Also, optimize network bandwidth by using a high-quality network link, Implement retry policies with exponential backoff to handle transient errors and throttling.

    For Azure File Share:

    Choose the Right Performance Tier, Use Premium file shares for high IOPS and low latency requirements. Premium shares are backed by SSDs, which offer better performance compared to standard shares, Mount the Azure File Share to your Azure Function using SMB or NFS protocols. This allows your Python code to read files directly from the mounted share.

    Adjust the I/O size and queue depth to match your workload requirements. Larger I/O sizes can improve throughput, while an appropriate queue depth can handle more concurrent requests.

    Continuously monitor the performance of your file share and log any issues. This helps in identifying bottlenecks and optimizing performance.

    For more information:
    https://learn.microsoft.com/en-us/azure/storage/blobs/storage-performance-checklist https://learn.microsoft.com/en-us/azure/storage/files/understand-performance


    Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members. 

    If you have any other questions or are still running into more issues, let me know in the "comments" and I would be happy to help you

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.