[Solved] Debugging multi-process pvserver issue relating to X display

paulmelis · October 30, 2020, 10:33am

I’m running 5.8.0 in client-server mode on a node with 4 GPUs (configured as separate X screens, see info below). This is under OpenMPI, using the following launch command:

#!/bin/sh
module load 2020 ParaView
export DISPLAY=:0.0

V=9
mpirun \
	--map-by node \
	-np 1 `which pvserver` -display :0:0 --force-offscreen-rendering -v=$V : \
	-np 1 `which pvserver` -display :0.1 --force-offscreen-rendering -v=$V : \
	-np 1 `which pvserver` -display :0.2 --force-offscreen-rendering -v=$V : \
	-np 1 `which pvserver` -display :0.3 --force-offscreen-rendering -v=$V

After connecting to the server in the GUI it reports that “Display is not accessible on the server side.
Remote rendering will be disabled.”, but I fail to see why. I have tried adding --disable-xdisplay-test on the MPI processes, but then it crashes as soon a remote rendering is used:

(  26.246s) [pvserver.1      ]    vtkPVRenderView.cxx:1505     9| .   .   use_lod=0, use_distributed_rendering=1, use_ordered_compositing=0
(  26.253s) [pvserver.0      ]vtkXOpenGLRenderWindow.:450    ERR| .   .   vtkXOpenGLRenderWindow (0xd233e0): bad X server connection. DISPLAY=:0:0. Aborting.


Loguru caught a signal: SIGABRT
Stack trace:
27            0x402d7a /sw/arch/Debian10/EB_production/2020/software/ParaView/5.8.0-foss-2020a-Python-3.8.2-mpi/bin/pvserver() [0x402d7a]
26      0x7f0d59dd409b __libc_start_main + 235
25            0x402c99 /sw/arch/Debian10/EB_production/2020/software/ParaView/5.8.0-foss-2020a-Python-3.8.2-mpi/bin/pvserver() [0x402c99]
24      0x7f0d58601ef1 vtkTCPNetworkAccessManager::ProcessEventsInternal(unsigned long, bool) + 609
23      0x7f0d583309a1 vtkMultiProcessController::ProcessRMIs(int, int) + 705
22      0x7f0d583301c0 vtkMultiProcessController::ProcessRMI(int, void*, int, int) + 272
21      0x7f0d58a0e494 vtkPVSessionServer::OnClientServerMessageRMI(void*, int) + 276
20      0x7f0d58a059d4 vtkPVSessionBase::ExecuteStream(unsigned int, vtkClientServerStream const&, bool) + 52
19      0x7f0d58a068a0 vtkPVSessionCore::ExecuteStream(unsigned int, vtkClientServerStream const&, bool) + 64
18      0x7f0d58a06a53 vtkPVSessionCore::ExecuteStreamInternal(vtkClientServerStream const&, bool) + 243
17      0x7f0d5851e4fd vtkClientServerInterpreter::ProcessStream(vtkClientServerStream const&) + 29
16      0x7f0d5851e037 vtkClientServerInterpreter::ProcessOneMessage(vtkClientServerStream const&, int) + 167
15      0x7f0d5851df35 vtkClientServerInterpreter::ProcessCommandInvoke(vtkClientServerStream const&, int) + 1189
14      0x7f0d5851d92d vtkClientServerInterpreter::CallCommandFunction(char const*, vtkObjectBase*, char const*, vtkClientServerStream const&, vtkClientServerStream&) + 477
13      0x7f0d58f4cef0 vtkPVRenderViewCommand(vtkClientServerInterpreter*, vtkObjectBase*, char const*, vtkClientServerStream const&, vtkClientServerStream&, void*) + 7040
12      0x7f0d569d42f2 vtkPVRenderView::StillRender() + 98
11      0x7f0d569de0d3 vtkPVRenderView::Render(bool, bool) + 1699
10      0x7f0d569f31d5 /sw/arch/Debian10/EB_production/2020/software/ParaView/5.8.0-foss-2020a-Python-3.8.2-mpi/lib/libvtkRemotingViews-pv5.8.so.1(+0x1dd1d5) [0x7f0d569f31d5]
9       0x7f0d569caa22 vtkPVProcessWindow::PrepareForRendering() + 34
8       0x7f0d543b92d0 vtkXOpenGLRenderWindow::Render() + 32
7       0x7f0d543296ee vtkOpenGLRenderWindow::Render() + 14
6       0x7f0d533e50b2 vtkRenderWindow::Render() + 178
5       0x7f0d543b8f5b vtkXOpenGLRenderWindow::Start() + 91
4       0x7f0d543b4a83 vtkXOpenGLRenderWindow::WindowInitialize() + 19
3       0x7f0d543b82e9 vtkXOpenGLRenderWindow::CreateAWindow() + 2825
2       0x7f0d59dd2535 abort + 289
1       0x7f0d59de77bb gsignal + 267
0       0x7f0d59de7840 /lib/x86_64-linux-gnu/libc.so.6(+0x37840) [0x7f0d59de7840]
(  26.254s) [pvserver.0      ]                       :0     FATL| Signal: SIGABRT

So apparently as far as ParaView is concerned there indeed is a something going on in using OpenGL on display :0.0. But this doesn’t make sense as display :0.0 is definitely available (and on all screens) and supports the required OpenGL version:

$ DISPLAY=:0.0 glxinfo -B
name of display: :0.0
display: :0  screen: 0
direct rendering: Yes
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 24576 MB
    Total available memory: 24576 MB
    Currently available dedicated video memory: 24118 MB
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: TITAN RTX/PCIe/SSE2
OpenGL core profile version string: 4.6.0 NVIDIA 450.80.02
OpenGL core profile shading language version string: 4.60 NVIDIA
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6.0 NVIDIA 450.80.02
OpenGL shading language version string: 4.60 NVIDIA
OpenGL context flags: (none)
OpenGL profile mask: (none)

OpenGL ES profile version string: OpenGL ES 3.2 NVIDIA 450.80.02
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20



display: :0  screen: 1
direct rendering: Yes
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 24576 MB
    Total available memory: 24576 MB
    Currently available dedicated video memory: 24208 MB
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: TITAN RTX/PCIe/SSE2
OpenGL core profile version string: 4.6.0 NVIDIA 450.80.02
OpenGL core profile shading language version string: 4.60 NVIDIA
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6.0 NVIDIA 450.80.02
OpenGL shading language version string: 4.60 NVIDIA
OpenGL context flags: (none)
OpenGL profile mask: (none)

OpenGL ES profile version string: OpenGL ES 3.2 NVIDIA 450.80.02
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20



display: :0  screen: 2
direct rendering: Yes
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 24576 MB
    Total available memory: 24576 MB
    Currently available dedicated video memory: 24208 MB
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: TITAN RTX/PCIe/SSE2
OpenGL core profile version string: 4.6.0 NVIDIA 450.80.02
OpenGL core profile shading language version string: 4.60 NVIDIA
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6.0 NVIDIA 450.80.02
OpenGL shading language version string: 4.60 NVIDIA
OpenGL context flags: (none)
OpenGL profile mask: (none)

OpenGL ES profile version string: OpenGL ES 3.2 NVIDIA 450.80.02
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20



display: :0  screen: 3
direct rendering: Yes
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 24576 MB
    Total available memory: 24576 MB
    Currently available dedicated video memory: 24208 MB
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: TITAN RTX/PCIe/SSE2
OpenGL core profile version string: 4.6.0 NVIDIA 450.80.02
OpenGL core profile shading language version string: 4.60 NVIDIA
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6.0 NVIDIA 450.80.02
OpenGL shading language version string: 4.60 NVIDIA
OpenGL context flags: (none)
OpenGL profile mask: (none)

OpenGL ES profile version string: OpenGL ES 3.2 NVIDIA 450.80.02
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

Also, if I start only a single pvserver process and let it use :0.0 all is well for each of these different sets of options:

paulm@r28n3 11:26 ~/datasets/coraltest/pv$ mpirun -np 1 `which pvserver` -display :0.0 --force-offscreen-rendering
Waiting for client...
Connection URL: cs://r28n3.lisa.surfsara.nl:11111
Accepting connection(s): r28n3.lisa.surfsara.nl:11111
Client connected.
^Cpaulm@r28n3 11:28 ~/datasets/coraltest/pv$ mpirun --map-by node -np 1 `which pvserver` -display :0.0 --force-offscreen-rendering
Waiting for client...
Connection URL: cs://r28n3.lisa.surfsara.nl:11111
Accepting connection(s): r28n3.lisa.surfsara.nl:11111
Client connected.
Exiting...
paulm@r28n3 11:28 ~/datasets/coraltest/pv$ mpirun --map-by node -np 1 `which pvserver` -display :0.0                            
Waiting for client...
Connection URL: cs://r28n3.lisa.surfsara.nl:11111
Accepting connection(s): r28n3.lisa.surfsara.nl:11111
Client connected.
Exiting...

In these 3 situations I can see remote rendering at work (when lowering the remote render threshold to 0) when I interact.

Any suggestions on what to try in order to figure out what’s going on here?

paulmelis · October 30, 2020, 10:59am

Hmmm, something fishy is going on, will need to check further:

paulm@r28n3 11:59 ~/datasets/coraltest/pv$ cat run_glxinfo.sh 
#!/bin/sh
mpirun \
	--map-by node \
	-np 1 `which glxinfo` -display :0:0 : \
	-np 1 `which glxinfo` -display :0.1 : \
	-np 1 `which glxinfo` -display :0.2 : \
	-np 1 `which glxinfo` -display :0.3 
paulm@r28n3 11:59 ~/datasets/coraltest/pv$ ./run_glxinfo.sh 
Error: unable to open display :0:0
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[59014,1],0]
  Exit code:    255
--------------------------------------------------------------------------

So glxinfo also runs into the same issue.

paulmelis · October 30, 2020, 11:00am

Oh man, I used :0:0 instead of :0.0

paulmelis · October 30, 2020, 11:01am

Okay, that was it. That’s as subtle a typo I’ve made in a long time, doh!