Best storage option for multiple csv files on blob storage to be consumed in a ML training job

Question

Hello i have a question about what is the best storage option for multiple csv files that are created on daily basis, if they are to be consumed later in an Azure Machine learning training job.

I am looking into two options, the first is creating an MLTable DataAsset, by specifying the folder where the csv files are stored in the yaml file, and the other is to append periodically the different csv files into another storage (the easiest might be a csv).

My concern about the first approach is whether uploading multiple CSV files into the MLTable, is it computationally intense?

For the second option, which is appending different csvs, to one storage location, what is the optimized option of file storage that would be able to handle large data, and at the same time be read quickly from the AML pipeline.

Thanks in Advance

Answer

Hi @EL Jawad, Mohammad,

Thank you for reaching out to Microsoft Q&A forum!

For handling multiple CSV files in Azure Machine Learning (AML), using a URI Folder is the recommended approach instead of creating a DataAsset. An URI Folder points directly to a folder in an Azure Blob container, allowing AML to process multiple files efficiently without the need for manual merging.

Azure ML’s data runtime also supports multi-process (parallel loading) and background data prefetching, reducing computational overhead and improving performance with minimal configuration effort on your end.

While DataAssets are generally used for single files, a URI_Folder is more scalable when dealing with multiple CSVs. You can find more details in the Data concepts in Azure Machine Learning documentation: Azure ML Data Concepts.

I hope you understand. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful.

Share via

Best storage option for multiple csv files on blob storage to be consumed in a ML training job

1 answer

Your answer