Resolving GPU memory errors in databricks with PyTorch on g4dn.2xlarge instances

Gabriel-2005 365 Reputation points
2024-11-18T07:29:13.7+00:00

I'm encountering persistent memory issues while training and testing a PyTorch model on Databricks. In my Databricks job configuration, I’ve specified node_type_id and driver_node_type_id as g4dn.2xlarge. According to the documentation, this instance type should provide up to 32GB of memory. However, when I run the job, I receive a CUDA out-of-memory error indicating that only 14GB of GPU memory is available:

'CUDA out of memory. Tried to allocate 126.00 MiB (GPU 0; 14.76 GiB total capacity; 12.97 GiB already allocated; 99.75 MiB free; 13.20 GiB reserved in total by PyTorch). If reserved memory is >> allocated memory, try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF.'

Is there a specific parameter I need to modify in the configuration JSON to ensure that the job utilizes the full capabilities of the g4dn.2xlarge instance?

 

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,236 questions
{count} votes

Accepted answer
  1. Smaran Thoomu 17,520 Reputation points Microsoft Vendor
    2024-11-18T09:17:31.0233333+00:00

    Hi @Gabriel-2005
    Welcome to Microsoft Q&A platform and thanks for posting your query here.

    Databricks is utilizing the specified instance type, but it's important to note that the g4dn.2xlarge instance offers 32 GiB of system memory, while its GPU only provides 16 GiB. A portion of this GPU memory is reserved for the operating system, which explains why your error shows an available memory of 14.76 GiB.

    To resolve this issue, I suggest choosing a different instance type that satisfies the minimum GPU memory needs for your workload. Since AWS frequently introduces new instances and updates existing ones, I won't suggest a specific type. However, you can use https://instances.vantage.sh/ to filter and find suitable instances based on your requirements.

    For reference: https://learn.microsoft.com/en-us/azure/databricks/machine-learning/train-model/huggingface/fine-tune-model#troubleshoot-common-cuda-errors

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.