Hi @Green, Jim
Reprovisioning can indeed cause disruptions in MPI runs. When a node is reprovisioned, it can lead to the termination of all processes running on that node, which would explain the abrupt end to your MPI run. To determine if reprovisioning is the cause, you can check the logs for any indications of node reprovisioning events. Look for messages related to node health, maintenance, or scaling events in your cloud provider's monitoring tools.
for the thought on Higher SKU Options- The FsV2 series is indeed compute-optimized and suitable for floating-point intensive tasks. However, if you're looking for faster performance without unnecessary RAM/Storage/GPU, you might consider the following:
F32sv2 SKU: This could be a good strategy. Using a single high CPU SKU like F32sv2 can offer several benefits like
Faster Start Times: One core should start faster and have less chance of allocation failure.
Reduced Risk: Only one node at risk of reprovision.
Faster In-Memory MPI Transfers: This can improve the overall performance of your MPI tasks.
More MPI Processes: You can fit more MPI processes than with separate 2-core machines.
For the downsides, while the F32sv2 SKU strategy has its advantages, there are potential downsides:
Single Point of Failure: If the single node fails, it could impact all your processes.
Resource Contention: With more processes on a single node, there might be contention for CPU resources, which could affect performance. I would recommend trying different configurations.