Reverse client-server connection fails with more than one MPI process

1. MPI compatibility

Following the suggestion of @Dave_DeMarle, I have made sure that both client and server sides are built with the flag USE_SYSTEM_mpi=ON so that when I launch a command

mpirun -np 4 $PV570DIR/bin/pvserver --hostname=$(hostname -a) -rc --client-host=[client-url] --disable-xdisplay-test --timeout=1

the mpirun version at run time is the same as at compile time. Also, client and server tap from the same pool of programs and shared libraries.


2. cmake settings

The external cache variables in CMakeCache.txt are the same in one setting. In another setting they differ because the client has qt5, the server elg. In both settings problems occur.


3. setting qt5+elg with --disable-xdisplay-test

3.1 server starts

I receive the following messages

Connecting to client (reverse connection requested)…
Connection URL: csrc://[client-url]:11111
Connecting to client (reverse connection requested)…
Connection URL: csrc://[client-url]:11111
Connecting to client (reverse connection requested)…
Connection URL: csrc://[client-url]:11111
Connecting to client (reverse connection requested)…
Connection URL: csrc://[client-url]:11111

3.2 client connects

At the client side I see the notifications

Accepting connection(s): [client-hostname]:11111
Accepting connection(s): [client-hostname]:11111

At the server side

[31m( 26.150s) [pvserver ]vtkSocketCommunicator.c:808 ERR| vtkSocketCommunicator (0x56537cb0bff0): Could not receive tag. 1010580540
[31m( 26.150s) [pvserver ]vtkSocketCommunicator.c:557 ERR| vtkSocketCommunicator (0x56537cb0bff0): Endian handshake failed.
Client connected.
[31m( 26.150s) [pvserver ]vtkTCPNetworkAccessMana:333 ERR| vtkTCPNetworkAccessManager (0x56537c975b60):

Connection failed during handshake. vtkSocketCommunicator::GetVersion()
returns different values on the two connecting processes
(Current value: 100).
Exiting…

So one connection works and the other exits. The other two idle connections continue to loop over the sequence 472-51-396

[31m( 25.176s) [pvserver ] vtkSocket.cxx:472 ERR| vtkClientSocket (0x5622bf734610): Socket error in call to connect. Connection refused.
[31m( 25.176s) [pvserver ] vtkClientSocket.cxx:51 ERR| vtkClientSocket (0x5622bf734610): Failed to connect to server [client-url]:11111
[33m( 25.176s) [pvserver ]vtkTCPNetworkAccessMana:396 WARN| vtkTCPNetworkAccessManager (0x5622becadb60): Connect failed. Retrying for 34.9704 more seconds.

until time-out, while the client operates.

3.3 clients works

I manage to operate at the client side and load the data from the server side. At the server there is one process pvserver at work. Effectively a sequential job.


4. setting qt5+elg without --disable-xdisplay-test

At the point of connection I have 1 working connection (hence a pvserver process), 3 rapid exits and 0 timeouts. The threefold exit message is again

Connection failed during handshake. vtkSocketCommunicator::GetVersion() returns different values on the two connecting processes (Current value: 100).

and pvserver works sequentially as in Sec. 3.3.


5. setting qt5+qt5 with --disable-xdisplay-test

In this case I also set the cache variable Qt5_DIR to where I could locate the file sb-qt5-configure.cmake.

5.1 server starts

At the server side I see that prgep pvserver lists four process IDs.

5.2 client connects

However, of the four processes only one connects to the client and the other three keep on trying.

5.3 client does not work

When I try to load data from the server the connection at the server side crashes with the message

X Error of failed request: BadValue (integer parameter out of range for operation)
Major opcode of failed request: 154 (GLX)
Minor opcode of failed request: 3 (X_GLXCreateContext)
Value in failed request: 0x0
Serial number of failed request: 52
Current serial number in output stream: 53

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

with three SIGTERMS and finally with

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:
Process name: [[64131,1],3]
Exit code: 1

These errors are why I moved to the qt5+egl configuration, see top post .


6. setting qt5+qt5 without --disable-xdisplay-test

One connection works. The other one exits with the same connection-failed-during-handshake as in Sec 4, and then the system crashes with the X-error, 3 SIGTERMs and the MPI job termination as in Sec 5.3.


7. similar issues reported?


8. questions

Where could the problem be?
How to get pvserver working parallel in full glory?
Tips and/or directions?