MPI error in Azure Batch
In MPI-enabled tasks, occasionally we will get this error:
Aborting: smpd on a3315966100000A failed to communicate with child smpd manager
If we restart the exact same task it will complete.
This seems to suggest that a node has terminated and is no longer communicating. The application does normally catch and report exceptions but nothing is generated in these cases. I've tried various ways to terminate a node prematurely but MPI still manages to report it.
What could we do to track down the cause of this? Don't know if this is an application problem or just a Batch glitch that we have to live with.
Thanks.