Slurm error when running pvserver with mpiexec.

NotDrJeff · November 24, 2022, 9:30pm

Hello,

I have downloaded a linux paraview-5.10 tarball (https://www.paraview.org/paraview-downloads/download.php?submit=Download&version=v5.10&type=binary&os=Linux&downloadFile=ParaView-5.10.0-osmesa-MPI-Linux-Python3.9-x86_64.tar.gz) for running on a remote HPC cluster.

I have successfully connected to pvserver, via an ssh tunnel, from my windows 10 desktop client when running in serial. I am now attempting to run pvserver in parallel.

The cluster is using CentOS 7.9.2009, with slurm 22.05.2. I request compute resources by using “srun --ntasks=2 … --pty bash”

I run “mpiexec -np pvserver --sp 12029”. My terminal hangs for far longer than I think it should, then I get an error from slurm: “srun: Job … step creation temporarily disabled, retrying (Requested nodes are busy)”. I then periodically get the error “srun: Job … step creation still disabled, retrying (Requested nodes are busy)”

Of course, this could be a slurm issue and not related to paraview. However, I have attempted to run other applications in parallel (e.g. an openfoam CFD solver) using the paraview mpiexec binary and encountered the same problem. For clarity, these other applications run fine with my system version of openmpi (mpirun command).

pvserver unfortunately does not run with my sytem mpi (I get the “Socket error in call to bind. Address already in use.” documented elsewhere on this forum)

All this makes me suspect the problem is with the mpiexec binary (or how I am using it!)

Any advice on this would be much appreciated.

Thanks,

mwestphal · November 25, 2022, 7:39am

Why not 5.11.0 or 5.10.1 ?

“mpiexec -np pvserver --sp 12029”.

make sure to use the right executable by using path, eg:

./mpiexec -np 4 ./pvserver

pvserver unfortunately does not run with my sytem mpi

What is your system mpi ?

You may be able to use --system-mpi, see this:

tbh in an HPC context you want to use your system MPI, so if mpich compat do not work for you, I’d consider compiling pvserver with the HPC provided mpi for better performance.

NotDrJeff · November 25, 2022, 2:32pm

Thanks for the input Mathieu.

Why not 5.11.0 or 5.10.1

No good reason. I already had 5.10.0 on my local machine, and I was told I should use the same version on the server. I have now tried using 5.11.0, but encounter the same issue.

make sure to use the right executable by using path

Yes, I am using specifying the path to the executables correctly. I was just being concise in my question!

What is your system mpi ?

Our HPC has a bunch of different openmpi versions that I can load. I have tried with our system default (openmpi 4.0.0) and the newest version we have installed (openmpi 4.1.1), with and without the “–system-mpi” flag. Neither of these work. In both cases, I get an “Address already in use error” for every process, which has been documented elsewhere on this forum and is what led me to try downloading precompiled binaries (e.g. ParaView Parallel pvserver (MPI) - Problems when running on a Cluster).

I get the same errors when I try to use some preinstalled versions of paraview on our cluster. I’m not sure if these were compiled on the system, but I think they might be because I was not able to use remote (mesa) rendering with them (whereas I am able to with the downloaded binaries).

mwestphal · November 25, 2022, 3:30pm

OpenMPI do not support the MPICH compat, so system mpi will not work.
./mpiexec -np 4 ./pvserver should work, you can try on another system than your HPC.
You may want to contact your HPC IT.

NotDrJeff · November 26, 2022, 7:17pm

I’ve confirmed that mpiexec with pvserver runs fine on my local machine (Ubuntu 22.04). I will get in touch with my IT team and update this topic if we resolve the issue.

Thanks,

NotDrJeff · November 29, 2022, 7:40pm

I have resolved this issue. It seems that mpich does not play well with slurm’s “srun” command, though I can’t say why (openmpi works fine).
Using mpich with the “sbatch” command, or alternatively, ssh-ing directly into a compute node and running mpich directly, works as expected.

So in summary:

Openmpi and mpich are two different implementations of mpi. Paraview is designed to work with mpich, not openmpi
mpich does not work with the srun command in slurm, but needs to be used with sbatch.

Thank you for your help Matthieu

mwestphal · November 30, 2022, 9:05am

If you meant to write “ParaView binaries” then this is correct, but ParaView can be compiled against and then used with openmpi witthout issue.