MPI fails when mixing Intel and AMD
The following code works fine with only Intel or only AMD machines. When I launch from an Intel machine, with two worker nodes.
#include <iostream>
#include <mpi.h>
int main()
{
int argc = 0;
MPI_Init(&argc, nullptr);
const int count = 100;
for (int i = 0; i < count; ++i)
{
std::cout << " Attempting Barrier " << i + 1 << std::endl;
MPI_Barrier(MPI_COMM_WORLD);
std::cout << " Completed Barrier " << i + 1 << std::endl;
}
MPI_Finalize();
}
mpiexec -l -hosts 2 localhost amd_machine -wdir "\network\path" \path-to-exe
This fails consistently after loop 3, with the output:
[0] Attempting Barrier 1 [1] Attempting Barrier 1 [0] Completed Barrier 1 [0] Attempting Barrier 2 [1] Completed Barrier 1 [0] Completed Barrier 2 [1] Attempting Barrier 2 [0] Attempting Barrier 3 [0] Completed Barrier 3 [0] Attempting Barrier 4 [1] Completed Barrier 2 [1] Attempting Barrier 3 [1] Completed Barrier 3 [1] Attempting Barrier 4 job aborted: [ranks] message [0] terminated [1] fatal error Fatal error in MPI_Barrier: Other MPI error, error stack: MPI_Barrier(MPI_COMM_WORLD) failed A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (errno 10060)