How to improve speed of data transfer of RO_MOUNT Data asset in Azure ML Job?

Rahul Kurian Jacob 20 Reputation points
2024-09-03T11:52:26.11+00:00

Is there a reason for this slow transfer speed during job run? This slowdown is causing GPU utilization to tank from 85-90% (if files are on disk during test runs) to 12-15% during actual job training with complete data being transferred through RO_MOUNT.

Context:

I am invoking a PyTorch training job that takes in data from an RO_MOUNT data asset pointing to a blob in Azure Data Lake Storage. I am using Standard_NC4as_T4_v3 (4 cores, 28 GB RAM, 176 GB disk) and have confirmed, through mounting this VM as a (non-cluster) compute for Azure ML Notebook, that azcopy can get ~500 MB/s (half of the Expected network bandwidth in docs).

But during the job run, the transfer rate starts at 60 MB/s and steadily reduces to get an average of 15 MB/s in the monitor: User's image

Increasing the number of workers in the Dataloader creates an error due to lack of /dev/shm space:

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

Is there a solution to this issue?

P.S. Increase in RAM usage by filling up of shm is not being recorded by the "CPU Memory Usage" in Monitoring tab, but that is a separate issue.

Azure Data Lake Storage
Azure Data Lake Storage
An Azure service that provides an enterprise-wide hyper-scale repository for big data analytic workloads and is integrated with Azure Blob Storage.
1,466 questions
Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
2,859 questions
{count} votes

Accepted answer
  1. santoshkc 8,100 Reputation points Microsoft Vendor
    2024-09-16T11:45:30.8133333+00:00

    Hi @Rahul Kurian Jacob,

    I'm glad to hear that your issue has been resolved. And thanks for sharing the information, which might be beneficial to other community members reading this thread as solution. Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", so I'll repost your response to an answer in case you'd like to accept the answer. This will help other users who may have a similar query find the solution more easily.

    Query: How to improve speed of data transfer of RO_MOUNT Data asset in Azure ML Job?

    Solution: The issue is resolved. I don't know why but the issue in the previous comment got resolved by setting pin_memory in the Dataloader to False.

    So, a TL; DR for this problem is as follows:

    1. In command function of v2 SDK, set shm_size parameter to ~1/2 of VM RAM capacity.
    2. Increase workers in Dataloader to CPU core count / CPU core count - 1.
    3. Disable pin memory if using large VMs (CPU core count > 4).
    4. If data is too large to store in VM's local disk, data asset's mode parameter to ro_mount and in command function, set environment_variables parameter to this dictionary settings:

    PythonCopy

    # parameter in `command` function 
    environment_variables=dict( 
    	DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED=True, # enable block-based caching 
    	DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED=False, # disable caching on disk 
    	DATASET_MOUNT_MEMORY_CACHE_SIZE=0, # disabling in-memory caching 
    )
    

    This GitHub link is my source to most of these parameter settings: best-practices ViT-Pretrain

    If you have any further questions or concerns, please don't hesitate to ask. We're always here to help.


    Do click Accept Answer and Yes for was this answer helpful.

    1 person found this answer helpful.
    0 comments No comments

1 additional answer

Sort by: Most helpful
  1. Rahul Kurian Jacob 20 Reputation points
    2024-09-16T03:55:47.1333333+00:00

    Hello @santoshkc ,

    The issue is resolved. I don't know why but the issue in the previous comment got resolved by setting pin_memory in the Dataloader to False.

    So, a TL; DR for this problem is as follows:

    1. In command function of v2 SDK, set shm_size parameter to ~1/2 of VM RAM capacity.
    2. Increase workers in Dataloader to CPU core count / CPU core count - 1.
    3. Disable pin memory if using large VMs (CPU core count > 4).
    4. If data is too large to store in VM's local disk, data asset's mode parameter to ro_mount and in command function, set environment_variables parameter to this dictionary settings:
    # parameter in `command` function
    environment_variables=dict(
    	DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED=True,  # enable block-based caching
    	DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED=False,  # disable caching on disk
    	DATASET_MOUNT_MEMORY_CACHE_SIZE=0,  # disabling in-memory caching
    )
    

    This GitHub link is my source to most of these parameter settings: best-practices ViT-Pretrain

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.