I have successfully connected to pvserver, via an ssh tunnel, from my windows 10 desktop client when running in serial. I am now attempting to run pvserver in parallel.
The cluster is using CentOS 7.9.2009, with slurm 22.05.2. I request compute resources by using “srun --ntasks=2 … --pty bash”
I run “mpiexec -np pvserver --sp 12029”. My terminal hangs for far longer than I think it should, then I get an error from slurm: “srun: Job … step creation temporarily disabled, retrying (Requested nodes are busy)”. I then periodically get the error “srun: Job … step creation still disabled, retrying (Requested nodes are busy)”
Of course, this could be a slurm issue and not related to paraview. However, I have attempted to run other applications in parallel (e.g. an openfoam CFD solver) using the paraview mpiexec binary and encountered the same problem. For clarity, these other applications run fine with my system version of openmpi (mpirun command).
pvserver unfortunately does not run with my sytem mpi (I get the “Socket error in call to bind. Address already in use.” documented elsewhere on this forum)
All this makes me suspect the problem is with the mpiexec binary (or how I am using it!)
tbh in an HPC context you want to use your system MPI, so if mpich compat do not work for you, I’d consider compiling pvserver with the HPC provided mpi for better performance.
No good reason. I already had 5.10.0 on my local machine, and I was told I should use the same version on the server. I have now tried using 5.11.0, but encounter the same issue.
make sure to use the right executable by using path
Yes, I am using specifying the path to the executables correctly. I was just being concise in my question!
What is your system mpi ?
Our HPC has a bunch of different openmpi versions that I can load. I have tried with our system default (openmpi 4.0.0) and the newest version we have installed (openmpi 4.1.1), with and without the “–system-mpi” flag. Neither of these work. In both cases, I get an “Address already in use error” for every process, which has been documented elsewhere on this forum and is what led me to try downloading precompiled binaries (e.g. ParaView Parallel pvserver (MPI) - Problems when running on a Cluster).
I get the same errors when I try to use some preinstalled versions of paraview on our cluster. I’m not sure if these were compiled on the system, but I think they might be because I was not able to use remote (mesa) rendering with them (whereas I am able to with the downloaded binaries).
I’ve confirmed that mpiexec with pvserver runs fine on my local machine (Ubuntu 22.04). I will get in touch with my IT team and update this topic if we resolve the issue.
I have resolved this issue. It seems that mpich does not play well with slurm’s “srun” command, though I can’t say why (openmpi works fine).
Using mpich with the “sbatch” command, or alternatively, ssh-ing directly into a compute node and running mpich directly, works as expected.
So in summary:
Openmpi and mpich are two different implementations of mpi. Paraview is designed to work with mpich, not openmpi
mpich does not work with the srun command in slurm, but needs to be used with sbatch.