Hi Arianne Chung,
Welcome to Microsoft Q&A Platform, thanks for posting your query here
Some alternative methods to capture the size of your data lake more efficiently:
Parallel Processing:
- Horizontal Scaling: Distribute the workload across multiple nodes to process different parts of the data lake simultaneously. Tools like Apache Spark can help with this.
https://learn.microsoft.com/en-us/azure/databricks/lakehouse-architecture/performance-efficiency/best-practices - Serverless Architectures: Use serverless compute services Azure Functions to run multiple instances of your script in parallel.
- Optimized Data Storage:
- Partitioning: Organize your data into partitions based on certain criteria (e.g., date, region) to reduce the amount of data each script instance needs to process
- Incremental Updates:
- Instead of scanning the entire data lake each time, track changes and only process new or modified data. This can be achieved using tools like Apache Hudi or Delta Lake
- Performance Monitoring:
Regularly monitor and optimize the performance of your data lake operations. This includes identifying and addressing bottlenecks in your current script.
https://learn.microsoft.com/en-us/azure/databricks/lakehouse-architecture/performance-efficiency/best-practices
Implementing these strategies can significantly reduce the time required to capture the size of your data lake
Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.
If you have any other questions or are still running into more issues, let me know in the "comments" and I would be happy to help you