MPI error in Azure Batch

Green, Jim 55

In MPI-enabled tasks, occasionally we will get this error:

Aborting: smpd on a3315966100000A failed to communicate with child smpd manager

If we restart the exact same task it will complete.

This seems to suggest that a node has terminated and is no longer communicating. The application does normally catch and report exceptions but nothing is generated in these cases. I've tried various ways to terminate a node prematurely but MPI still manages to report it.

What could we do to track down the cause of this? Don't know if this is an application problem or just a Batch glitch that we have to live with.

Thanks.

deherman-MSFT 37,856 Reputation points Microsoft Employee

2024-10-31T17:42:47.1366667+00:00

@Green, Jim I am checking on this and will let you know what I find.
deherman-MSFT 37,856 Reputation points Microsoft Employee

2024-11-04T18:45:08.86+00:00

@Green, Jim

I have reached out via PM for more details. Please respond so we can investigate this closer.
Green, Jim 55 Reputation points

2025-01-07T18:15:04.0733333+00:00

@deherman-MSFT Any update on this? Thanks.
Green, Jim 55 Reputation points

2025-02-13T16:28:34.43+00:00

I just ran across this in "Guidelines for Running MPI Applications in Azure":

Additionally, Windows Azure nodes are periodically reprovisioned by the Windows Azure system. If a node is reprovisioned while it is running an MPI job, the MPI job will fail. The more nodes you are using for a single MPI job, and the longer the job runs, the more likely it is that one of the nodes will be reprovisioned while the job is running.

A task that just failed with the same error was almost 5 hours in. Could a reprovision explain this error?

We are going to try increasing the node count but it sounds like this could also increase the risk of a reprovision? Is there anything else we should consider?
kobulloc-MSFT 26,421 Reputation points Microsoft Employee

2025-02-19T20:12:11.59+00:00

From the now closed duplicate thread:

Akshay kumar Mandha
Hi Green, Jim, Welcome to the Microsoft Q&A Platform! Thank you for reaching out with your question. I understand your concern about node failures. Have you tried implementing retry logic? By doing so, you can automatically restart any tasks that fail due to a node being replaced. This approach ensures that if a task fails, it can restart and attempt to complete again without manual intervention. Alternatively, you could use higher-performance nodes to handle more work per node. This strategy can help reduce the number of nodes needed and balance the trade-off between task duration and node replacements. By completing your tasks more quickly, you can minimize the risk of node replacements affecting your jobs. This might help you to resolve the issue by applying these methods, you can enhance the reliability and efficiency of your MPI tasks in Azure Batch If you have any further query, please to let me know. I will help you as needed.!
Green, Jim 55 Reputation points

2025-02-19T21:14:30.68+00:00

The concern is not so much with automatic retry but why the error is happening to begin with. If it is in fact a reprovision on a node that brings down the entire MPI run, then it's pretty disturbing that we had 20 nodes running for say 5 hours only to have the rug pulled out from under it. This is real money that was wasted!

Do you think a reprovision is a likely culprit? Is there anyway we can determine this, like with logs somewhere?

We have already bumped up the node count and after a few days have not seen the error recur.

As to a higher SKU, we're using the FsV2 series which is advertised as compute-optimized (the task is mostly floating point). We don't need more memory or diskspace/speed. What would you suggest for faster performance w/o wasting money on unnecessary RAM/Storage/GPU, etc? I was eyeing the AMD SPYC Genoa chip, such as the Dalds v6, but that does not seem to be offered in Batch.

Another thing we are considering is using a single high CPU SKU, like F32sv2. One core should start faster, have less chance of allocation failure, only one node at risk of reprovision, faster in-memory MPI transfers, and we could fit in more MPI processes than with separate 2 core machines (which only run one process per VM). Does this sound like a good strategy? Any downsides?

Thanks.
Prrudram-MSFT 27,886 Reputation points

2025-02-20T11:25:37.0066667+00:00

Hi @Green, Jim

Reprovisioning can indeed cause disruptions in MPI runs. When a node is reprovisioned, it can lead to the termination of all processes running on that node, which would explain the abrupt end to your MPI run. To determine if reprovisioning is the cause, you can check the logs for any indications of node reprovisioning events. Look for messages related to node health, maintenance, or scaling events in your cloud provider's monitoring tools.
for the thought on Higher SKU Options- The FsV2 series is indeed compute-optimized and suitable for floating-point intensive tasks. However, if you're looking for faster performance without unnecessary RAM/Storage/GPU, you might consider the following:

F32sv2 SKU: This could be a good strategy. Using a single high CPU SKU like F32sv2 can offer several benefits like

Faster Start Times: One core should start faster and have less chance of allocation failure.
Reduced Risk: Only one node at risk of reprovision.
Faster In-Memory MPI Transfers: This can improve the overall performance of your MPI tasks.
More MPI Processes: You can fit more MPI processes than with separate 2-core machines.

For the downsides, while the F32sv2 SKU strategy has its advantages, there are potential downsides:

Single Point of Failure: If the single node fails, it could impact all your processes.

Resource Contention: With more processes on a single node, there might be contention for CPU resources, which could affect performance. I would recommend trying different configurations.
Green, Jim 55 Reputation points

2025-02-20T17:03:29.18+00:00

Thanks for the quick response.

Could you point me to the specific logs that would show batch provisioning?

Going to try the F32sv2 option. The downsides don't seem bad: with MPI losing any node is already fatal. Planning to run only 30 processes (so the OS+other has 2 cores), so should have a full core and 2GB RAM per process.

Share via

MPI error in Azure Batch

Your answer