How to improve speed of data transfer of RO_MOUNT Data asset in Azure ML Job?

Rahul Kurian Jacob 20

Is there a reason for this slow transfer speed during job run? This slowdown is causing GPU utilization to tank from 85-90% (if files are on disk during test runs) to 12-15% during actual job training with complete data being transferred through RO_MOUNT.

Context:

I am invoking a PyTorch training job that takes in data from an RO_MOUNT data asset pointing to a blob in Azure Data Lake Storage. I am using Standard_NC4as_T4_v3 (4 cores, 28 GB RAM, 176 GB disk) and have confirmed, through mounting this VM as a (non-cluster) compute for Azure ML Notebook, that azcopy can get ~500 MB/s (half of the Expected network bandwidth in docs).

But during the job run, the transfer rate starts at 60 MB/s and steadily reduces to get an average of 15 MB/s in the monitor: User's image

Increasing the number of workers in the Dataloader creates an error due to lack of /dev/shm space:

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).

Is there a solution to this issue?

P.S. Increase in RAM usage by filling up of shm is not being recorded by the "CPU Memory Usage" in Monitoring tab, but that is a separate issue.

santoshkc 8,100 Reputation points Microsoft Vendor

2024-09-04T12:25:25.11+00:00

Hi @Rahul Kurian Jacob,

Thank you for reaching out to Microsoft Q&A forum!

Based on the information provided, it seems that the slow transfer speed during the job run could be due to the limited network bandwidth available to the VM. However, since you have confirmed that azcopy can get ~500 MB/s, it is possible that the issue is related to the way the data is being accessed during the job run.

One possible solution to improve the transfer speed is to use a cache to store the data locally on the VM. This can help reduce the number of network requests needed to access the data and improve the overall performance. You can use the Cache class from the torch.utils.data module to implement this. Regarding the issue with the lack of /dev/shm space, you can try increasing the size of the shared memory by adding a line to your Dockerfile. This will increase the size of the shared memory to the desired amount. You can adjust the size as needed based on the requirements of your job.

Also, see: Azure solution for data transfer.

I hope this helps. Thank you.
Rahul Kurian Jacob 20 Reputation points

2024-09-04T13:44:29.0466667+00:00
Hello @santoshkc ,

Thank you for your response.

Caching data (at least permanently) is not an option for our case as the amount of training data is much higher than even the highest end of the Standard NC family of VMs is able to provide.

I could not find a Cache class inside torch.utils.data, or torch in general.

I will try out changing the /dev/shm space in Dockerfile.

I will take a look at any means to improve the speed by modifying the parameters in the Dataloader. I will also try with other VM sizes.

Edit:
I want to link another question: https://learn.microsoft.com/en-us/answers/questions/1298918/abnormal-ressource-usage-with-pytorchs-dataloader

This is also a similar problem to mine except they are using the older v1 SDK instead of v2, which I am currently using.

Also, I am finding out that setting the size of /dev/shm is through shm_size parameter during Docker run which in v2 SDK there is no good answer: https://github.com/Azure/azureml-examples/issues/1459
santoshkc 8,100 Reputation points Microsoft Vendor

2024-09-05T12:02:03.09+00:00

Hi @Rahul Kurian Jacob,

The slow transfer speed during your PyTorch job likely results from network or storage limitations when accessing data from Azure Data Lake Storage. This impacts GPU utilization, especially with multiple DataLoader workers, which are constrained by insufficient shared memory (/dev/shm). Consider increasing the shared memory size in your Dockerfile, trying different VM sizes, and optimizing DataLoader settings to improve performance.

I hope you understand! Thank you.
santoshkc 8,100 Reputation points Microsoft Vendor

2024-09-06T14:17:27.0566667+00:00

Hi @Rahul Kurian Jacob,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Thank you.
Rahul Kurian Jacob 20 Reputation points

2024-09-09T09:02:02.01+00:00
Hello @santoshkc ,

I will post my progress in solving it so far:

Found a best practices for PyTorch training in GitHubExamples.

I suggested how to increase Shared Memory Size (/dev/shm) through the shm_size parameter in command (v2 SDK) function.

It also suggested mounting without local drive caching of loaded files through setting these local variables:

# parameter in `command` function environment_variables=dict( DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED=True, # enable block-based caching DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED=False, # disable caching on disk DATASET_MOUNT_MEMORY_CACHE_SIZE=0, # disabling in-memory caching )

This greatly improves performance in the beginning: Epoch 1 took only 57m 36s with training taking only 25 minutes of that. But performance quickly deteriorated:

Validation of Epoch 1 took 32 minutes even though it was 20% of the data.

Epoch 2 took 3h 23m 37s.

Train of Epoch 3 did not complete even after 9 hours, and I cancelled it at that point.

The transfer rate starts at 600 MB/s and then becomes sporadic:

I don't know what is clogging the training run. CPU Memory Usage is steady at 146.06 / 221.45 GB. The CPU never is at 100% (average is 88.1%). It would be great if any suggestions could be made.

Edit:

This training was run on the 24 core A100 GPU VM, the highest tier VM that our organization will allow.
santoshkc 8,100 Reputation points Microsoft Vendor

2024-09-10T14:21:24.9666667+00:00

Hi @Rahul Kurian Jacob,

Thank you for your detailed update. It’s promising that the changes you made initially improved performance, but it’s concerning that the issue persists and worsens over time. The sporadic transfer rates and the drop in GPU utilization suggest that network bandwidth fluctuations and potential inefficiencies in DataLoader settings might be contributing factors. I recommend further optimizing your DataLoader parameters, considering an increase in shared memory size, and exploring alternative data access strategies. Additionally, profiling your training job could provide valuable insights into the performance bottlenecks.

If these steps don’t resolve the issue, reaching out to Azure support for further investigation might be beneficial.

Accepted answer

santoshkc 8,100 Reputation points Microsoft Vendor

2024-09-16T11:45:30.8133333+00:00
Hi @Rahul Kurian Jacob,

I'm glad to hear that your issue has been resolved. And thanks for sharing the information, which might be beneficial to other community members reading this thread as solution. Since the Microsoft Q&A community has a policy that "The question author cannot accept their own answer. They can only accept answers by others ", so I'll repost your response to an answer in case you'd like to accept the answer. This will help other users who may have a similar query find the solution more easily.

Query: How to improve speed of data transfer of RO_MOUNT Data asset in Azure ML Job?

Solution: The issue is resolved. I don't know why but the issue in the previous comment got resolved by setting pin_memory in the Dataloader to False.

So, a TL; DR for this problem is as follows:

In command function of v2 SDK, set shm_size parameter to ~1/2 of VM RAM capacity.

Increase workers in Dataloader to CPU core count / CPU core count - 1.

Disable pin memory if using large VMs (CPU core count > 4).

If data is too large to store in VM's local disk, data asset's mode parameter to ro_mount and in command function, set environment_variables parameter to this dictionary settings:

PythonCopy

# parameter in `command` function environment_variables=dict( DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED=True, # enable block-based caching DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED=False, # disable caching on disk DATASET_MOUNT_MEMORY_CACHE_SIZE=0, # disabling in-memory caching )

This GitHub link is my source to most of these parameter settings: best-practices ViT-Pretrain

If you have any further questions or concerns, please don't hesitate to ask. We're always here to help.

Do click Accept Answer and Yes for was this answer helpful.
Please sign in to rate this answer.

1 person found this answer helpful.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

1 additional answer

Rahul Kurian Jacob 20 Reputation points

2024-09-16T03:55:47.1333333+00:00
Hello @santoshkc ,

The issue is resolved. I don't know why but the issue in the previous comment got resolved by setting pin_memory in the Dataloader to False.

So, a TL; DR for this problem is as follows:

In command function of v2 SDK, set shm_size parameter to ~1/2 of VM RAM capacity.

Increase workers in Dataloader to CPU core count / CPU core count - 1.

Disable pin memory if using large VMs (CPU core count > 4).

If data is too large to store in VM's local disk, data asset's mode parameter to ro_mount and in command function, set environment_variables parameter to this dictionary settings:

# parameter in `command` function environment_variables=dict( DATASET_MOUNT_BLOCK_BASED_CACHE_ENABLED=True, # enable block-based caching DATASET_MOUNT_BLOCK_FILE_CACHE_ENABLED=False, # disable caching on disk DATASET_MOUNT_MEMORY_CACHE_SIZE=0, # disabling in-memory caching )

This GitHub link is my source to most of these parameter settings: best-practices ViT-Pretrain
Please sign in to rate this answer.

0 comments No comments
Sign in to comment

Use comments to ask for clarification, additional information, or improvements to the question.

Your answer