Parallel cyclecloud slrum process stacked while reading input file
I am running a parallel job, using openmpi, on cycle cloud, using slrum (batch) on 5 nodes, 120 cores in each.
The job starts with reading the computational mesh, by each core. While reading, a mesh lock file appears for each core and disappears when it finishes reading. The problem: a new mesh lock file then appears. The process is stacked, though squeue shows as if it keeps running.
Running on a single node with 120 cores using mpirun directly (without sbatch) works fine, so it is slrum issue. Used to work for many runs. Actually worked yesterday. But now, though I restarted the scheduler several times, the process get stacked. Anyone has an idea how to resolve the issue?