cannot connect live to simulation on cluster

Rigel · February 25, 2020, 12:28pm

Hi,

I am having problems to connect live to a simulation running on a Linux cluster. Say that I have a simulation job 9876 running in one of the compute nodes of the cluster, and a pvserver job 1234 running on another compute node. The pvserver being used is the headless (osmesa) 5.7.0, as downloaded from ParaView’s website. I start by a shell in my local machine:

local:~$ ssh -XL 11111:node1234.$SERVER:11111 alves@$SERVER

From this point on, I am able to connect my local ParaView GUI (version 5.7.0, as downloaded from ParaView’s website) to the pvserver running on compute node 1234 of the cluster. Then comes the next commands:

cluster:~> ssh node1234.$SERVER
cluster:~> ssh -R 22222:localhost:22222 node9876.$SERVER

But then it starts to print the following line on the shell, multiple times until the simulation is finished:

connect_to localhost port 22222: failed.

ParaView GUI doesn’t crash, neither disconnects from the pvserver neither outputs any further message.

Could you help me figure out what is going on? The simulation is built with Catalyst 5.7.0 (built from source) and successfully generates the post-mortem VTK files.

Thank you very much,

utkarsh.ayachit · February 25, 2020, 12:50pm

Why is this needed? Can’t node9876 (assuming that’s the one running the simulation) directly connect back to node1234 (node running pvserver)? If so, you should just edit the Catalyst Python script to change the host name for the ParaView server to be node1234.

I am not entirely sure where this error is coming from? I cannot find any error message in ParaView (or VTK) code base that begins with “connect_to”.

Rigel · February 25, 2020, 1:06pm

Thanks Utkarsh for your prompt reply. Indeed, it is possible to change the “localhost” in the adapter python script:

coprocessor.DoLiveVisualization(datadescription, "localhost", 22222)

for the name of the compute node, but we use Slurm in our cluster: every time you submit a job (be it the simulation or the pvserver), it goes to a different compute node. We don’t want to have to manually modify the python script every time we want to see the simulation live. Futhermore, we would like to be able to start the pvserver after the simulation (which itself can last for days), connect to it live, see how it is going, disconnect, repeat the following day, etc. So we came up with the approach above, which worked for ParaView 5.5.2. We would like to upgrade everything related to ParaView / Catalyst, but it fails to connect live to the simulation with this message above (printed on terminal, on the same shell you run the ssh commands).

utkarsh.ayachit · February 25, 2020, 1:26pm

I’d still debug this in steps. First, set the pvserver hostname explicitly in the python script and make sure that that works. If so, then the issue is with the ssh -R …`.

FWIW, the pvserver will open a server-socket on the port 2222 on node1234 and await connections back from Catalyst script.

Rigel · February 25, 2020, 2:31pm

Thanks Utkarsh. Doing your way, I can see the simulation’s input appear in ParaView’s pipeline browser; but as soon as I click on it, the pvserver crashes and ParaView’s GUI closes with the following output:

( 380.357s) [paraview ]vtkSocketCommunicator.c:808 ERR| vtkSocketCommunicator (0x17159a0): Could not receive tag. 1

On its turn, the following is printed on the shell where I started the pvserver:

Loguru caught a signal: SIGSEGV
Stack trace:
9             0x401799 /home/alves/ParaView/5.7.0/bin/pvserver() [0x401799]
8       0x2b7fe52f0545 __libc_start_main + 245
7             0x40171e /home/alves/ParaView/5.7.0/bin/pvserver() [0x40171e]
6       0x2b7fea493205 vtkMultiProcessController::ProcessRMIs(int, int) + 869
5       0x2b7fea492b3b vtkMultiProcessController::BroadcastProcessRMIs(int, int) + 123
4       0x2b7fea492990 vtkMultiProcessController::ProcessRMI(int, void*, int, int) + 336
3       0x2b7fe6b04830 /home/h9/alves/ParaView/5.7.0/bin/../lib/libvtkPVServerManagerCore-pv5.7.so.1(+0x7b830) [0x2b7fe6b04830]
2       0x2b7fe6b03bca vtkLiveInsituLink::OnInsituPostProcess(double, long long) + 74
1       0x2b7fe712beb4 vtkExtractsDeliveryHelper::Update() + 1396
0       0x2b7fe53043f0 /lib64/libc.so.6(+0x363f0) [0x2b7fe53043f0]
( 378.086s) [pvserver.1      ]                       :0     FATL| Signal: SIGSEGV

simulation.py (6.2 KB) is the python script being used.