MPI error in Azure Batch

Green, Jim 55 Reputation points
2024-10-31T15:30:54.37+00:00

In MPI-enabled tasks, occasionally we will get this error:

Aborting: smpd on a3315966100000A failed to communicate with child smpd manager

If we restart the exact same task it will complete.

This seems to suggest that a node has terminated and is no longer communicating. The application does normally catch and report exceptions but nothing is generated in these cases. I've tried various ways to terminate a node prematurely but MPI still manages to report it.

What could we do to track down the cause of this? Don't know if this is an application problem or just a Batch glitch that we have to live with.

Thanks.

Azure Batch
Azure Batch
An Azure service that provides cloud-scale job scheduling and compute management.
347 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.