Reverse client-server connection fails with more than one MPI process

Situation

Working with ParaView 5.7.0

I have managed to set up a reverse client-server connection that works as expected when I launch the command

mpirun -np 1 ${PVDIR}/bin/pvserver -rc --client-host=<client ip> --hostname=<server name>

Possibly worth noticing. In order to have it going in my system, I had built the server ParaView slightly differently from the client, namely using a EGL backend to work around this notification:

X Error of failed request: BadValue (integer parameter out of range for operation)
Major opcode of failed request: 154 (GLX)
Minor opcode of failed request: 3 (X_GLXCreateContext)

After much searching over the internet, I borrowed the idea from this contribution to the thread pvserver rendering on "Quadro P4000" and it has worked. I can indeed open the connection and the client visualizes data stored and processed at the server side.

Exceptions

Things go wrong when I raise the number of mpi processes, for example

mpirun -np 4 ${PVDIR}/bin/pvserver -rc --client-host=<client ip> --hostname=<server name>

Launching the server before the client, I essentially get the same messages fourfold

Connecting to client (reverse connection requested)…
Connection URL: csrc://:11111

However, when I connect client to server and start a task, say an animation, there is only one process running at the server side (pgrep pvserver). One connection has exited with the message:

Connection failed during handshake. vtkSocketCommunicator::GetVersion()
returns different values on the two connecting processes
(Current value: 100).

The other two connections go into time-out.

Questions

Obviously I would like to harvest the work of the four processes/processors, but I have no clues about what more should be checked out. Any suggestion about actions, fixes and workaround?

The basic configuration, of reverse connecting from a mpi spawned pvserver is known to work elsewhere.

It seems like your mpirun command is spawning 4 independent copies of pvserver rather than one collective session. Make sure the mpi you are running pvserver with matches the mpi you built pvserver with. I’ve seen similar problems mixing intel and cray mpi’s for example.

1. MPI compatibility

Following the suggestion of @Dave_DeMarle, I have made sure that both client and server sides are built with the flag USE_SYSTEM_mpi=ON so that when I launch a command

mpirun -np 4 $PV570DIR/bin/pvserver --hostname=$(hostname -a) -rc --client-host=[client-url] --disable-xdisplay-test --timeout=1

the mpirun version at run time is the same as at compile time. Also, client and server tap from the same pool of programs and shared libraries.


2. cmake settings

The external cache variables in CMakeCache.txt are the same in one setting. In another setting they differ because the client has qt5, the server elg. In both settings problems occur.


3. setting qt5+elg with --disable-xdisplay-test

3.1 server starts

I receive the following messages

Connecting to client (reverse connection requested)…
Connection URL: csrc://[client-url]:11111
Connecting to client (reverse connection requested)…
Connection URL: csrc://[client-url]:11111
Connecting to client (reverse connection requested)…
Connection URL: csrc://[client-url]:11111
Connecting to client (reverse connection requested)…
Connection URL: csrc://[client-url]:11111

3.2 client connects

At the client side I see the notifications

Accepting connection(s): [client-hostname]:11111
Accepting connection(s): [client-hostname]:11111

At the server side

[31m( 26.150s) [pvserver ]vtkSocketCommunicator.c:808 ERR| vtkSocketCommunicator (0x56537cb0bff0): Could not receive tag. 1010580540
[31m( 26.150s) [pvserver ]vtkSocketCommunicator.c:557 ERR| vtkSocketCommunicator (0x56537cb0bff0): Endian handshake failed.
Client connected.
[31m( 26.150s) [pvserver ]vtkTCPNetworkAccessMana:333 ERR| vtkTCPNetworkAccessManager (0x56537c975b60):

Connection failed during handshake. vtkSocketCommunicator::GetVersion()
returns different values on the two connecting processes
(Current value: 100).
Exiting…

So one connection works and the other exits. The other two idle connections continue to loop over the sequence 472-51-396

[31m( 25.176s) [pvserver ] vtkSocket.cxx:472 ERR| vtkClientSocket (0x5622bf734610): Socket error in call to connect. Connection refused.
[31m( 25.176s) [pvserver ] vtkClientSocket.cxx:51 ERR| vtkClientSocket (0x5622bf734610): Failed to connect to server [client-url]:11111
[33m( 25.176s) [pvserver ]vtkTCPNetworkAccessMana:396 WARN| vtkTCPNetworkAccessManager (0x5622becadb60): Connect failed. Retrying for 34.9704 more seconds.

until time-out, while the client operates.

3.3 clients works

I manage to operate at the client side and load the data from the server side. At the server there is one process pvserver at work. Effectively a sequential job.


4. setting qt5+elg without --disable-xdisplay-test

At the point of connection I have 1 working connection (hence a pvserver process), 3 rapid exits and 0 timeouts. The threefold exit message is again

Connection failed during handshake. vtkSocketCommunicator::GetVersion() returns different values on the two connecting processes (Current value: 100).

and pvserver works sequentially as in Sec. 3.3.


5. setting qt5+qt5 with --disable-xdisplay-test

In this case I also set the cache variable Qt5_DIR to where I could locate the file sb-qt5-configure.cmake.

5.1 server starts

At the server side I see that prgep pvserver lists four process IDs.

5.2 client connects

However, of the four processes only one connects to the client and the other three keep on trying.

5.3 client does not work

When I try to load data from the server the connection at the server side crashes with the message

X Error of failed request: BadValue (integer parameter out of range for operation)
Major opcode of failed request: 154 (GLX)
Minor opcode of failed request: 3 (X_GLXCreateContext)
Value in failed request: 0x0
Serial number of failed request: 52
Current serial number in output stream: 53

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

with three SIGTERMS and finally with

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[64131,1],3]
Exit code: 1

These errors are why I moved to the qt5+egl configuration, see top post .


6. setting qt5+qt5 without --disable-xdisplay-test

One connection works. The other one exits with the same connection-failed-during-handshake as in Sec 4, and then the system crashes with the X-error, 3 SIGTERMs and the MPI job termination as in Sec 5.3.


7. similar issues reported?


8. questions

Where could the problem be?
How to get pvserver working parallel in full glory?
Tips and/or directions?

Endian-handshake problem solved in vtkSocketCommunicator errors: "Could not receive tag" and "Endian handshake failed"