Resolving GPU memory errors in databricks with PyTorch on g4dn.2xlarge instances

Question

I'm encountering persistent memory issues while training and testing a PyTorch model on Databricks. In my Databricks job configuration, I’ve specified node_type_id and driver_node_type_id as g4dn.2xlarge. According to the documentation, this instance type should provide up to 32GB of memory. However, when I run the job, I receive a CUDA out-of-memory error indicating that only 14GB of GPU memory is available:

'CUDA out of memory. Tried to allocate 126.00 MiB (GPU 0; 14.76 GiB total capacity; 12.97 GiB already allocated; 99.75 MiB free; 13.20 GiB reserved in total by PyTorch). If reserved memory is >> allocated memory, try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.'

Is there a specific parameter I need to modify in the configuration JSON to ensure that the job utilizes the full capabilities of the g4dn.2xlarge instance?

Accepted Answer

Hi @Gabriel-2005
Welcome to Microsoft Q&A platform and thanks for posting your query here.

Databricks is utilizing the specified instance type, but it's important to note that the g4dn.2xlarge instance offers 32 GiB of system memory, while its GPU only provides 16 GiB. A portion of this GPU memory is reserved for the operating system, which explains why your error shows an available memory of 14.76 GiB.

To resolve this issue, I suggest choosing a different instance type that satisfies the minimum GPU memory needs for your workload. Since AWS frequently introduces new instances and updates existing ones, I won't suggest a specific type. However, you can use https://instances.vantage.sh/ to filter and find suitable instances based on your requirements.

For reference: https://learn.microsoft.com/en-us/azure/databricks/machine-learning/train-model/huggingface/fine-tune-model#troubleshoot-common-cuda-errors

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Resolving GPU memory errors in databricks with PyTorch on g4dn.2xlarge instances

0 additional answers

Your answer