I’m using the 5.10.1 linux binaries on a compute/render node with four NVIDIA A100 GPUs. The X server setup on the node is working correctly, as I can use GPU-accelerated rendering in ParaView and e.g. Blender when running under VirtualGL, with both reporting an NVIDIA OpenGL device with the correct version, etc. Output of glxinfo
for display :0.0
is also as expected:
name of display: :0.0
display: :0 screen: 0
direct rendering: Yes
server glx vendor string: NVIDIA Corporation
server glx version string: 1.4
server glx extensions:
GLX_ARB_context_flush_control, GLX_ARB_create_context,
...
client glx vendor string: NVIDIA Corporation, NVIDIA Corporation, NVIDIA Corporation, NVIDIA Corporation
client glx version string: 1.4
client glx extensions:
GLX_ARB_context_flush_control, GLX_ARB_create_context,
...
GLX version: 1.4
GLX extensions:
GLX_ARB_context_flush_control, GLX_ARB_create_context,
GLX_ARB_create_context_no_error, GLX_ARB_create_context_profile,
Memory info (GL_NVX_gpu_memory_info):
Dedicated video memory: 40960 MB
Total available memory: 40960 MB
Currently available dedicated video memory: 40360 MB
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: NVIDIA A100-SXM4-40GB/PCIe/SSE2
OpenGL core profile version string: 4.6.0 NVIDIA 515.43.04
OpenGL core profile shading language version string: 4.60 NVIDIA
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
But when I run pvserver
on :0.0
it fails quite some time (35 seconds) after a connection from the client has been made, and the client is non-responsive all the time:
snellius paulm@gcn9 10:16 ~$ DISPLAY=:0.0 ~/software/ParaView-5.10.1-MPI-Linux-Python3.9-x86_64/bin/pvserver
Waiting for client...
Connection URL: cs://gcn9:11111
Accepting connection(s): gcn9:11111
Client connected.
<roughly 35 seconds>
X Error of failed request: BadValue (integer parameter out of range for operation)
Major opcode of failed request: 150 (GLX)
Minor opcode of failed request: 3 (X_GLXCreateContext)
Value in failed request: 0x0
Serial number of failed request: 89
Current serial number in output stream: 90
Using --force-offscreen-rendering
does not make a difference (in case this is an issue with no monitor attached to the node, even though fake EDID info is set explicitly in the xorg.conf). This sounds like there’s something going wrong in setting up the OpenGL context. To further figure out what’s going I’m using a few snippets of Python found on the forum here to report OpenGL info:
snellius paulm@gcn9 10:14 ~$ cat pv_opengl_info.py
# https://discourse.paraview.org/t/is-egl-enabled-in-any-of-the-kitware-paraview-binaries/5247
from paraview.simple import *
o = GetOpenGLInformation()
print('vendor :', o.GetVendor())
print('version :', o.GetVersion())
print('renderer :', o.GetRenderer())
#print('capabilities', o.GetCapabilities())
from paraview.modules.vtkRemotingViews import *
renInfo = vtkPVRenderingCapabilitiesInformation()
renInfo.GetCapabilities()
renInfo.CopyFromObject(None)
print('headless rendering using EGL:', renInfo.Supports(vtkPVRenderingCapabilitiesInformation.HEADLESS_RENDERING_USES_EGL))
Running this under different conditions
# Directly on the X server: FAILS, same error as above
snellius paulm@gcn9 10:16 ~$ DISPLAY=:0.0 ~/software/ParaView-5.10.1-MPI-Linux-Python3.9-x86_64/bin/pvpython pv_opengl_info.py
vendor : NVIDIA Corporation
version : 4.5.0 NVIDIA 515.43.04
renderer : NVIDIA A100-SXM4-40GB/PCIe/SSE2
X Error of failed request: BadValue (integer parameter out of range for operation)
Major opcode of failed request: 150 (GLX)
Minor opcode of failed request: 3 (X_GLXCreateContext)
Value in failed request: 0x0
Serial number of failed request: 89
Current serial number in output stream: 90
# Under VirtualGL: SUCCESS
snellius paulm@gcn9 10:25 ~$ vglrun ~/software/ParaView-5.10.1-MPI-Linux-Python3.9-x86_64/bin/pvpython pv_opengl_info.py
vendor : NVIDIA Corporation
version : 4.5.0 NVIDIA 515.43.04
renderer : NVIDIA A100-SXM4-40GB/PCIe/SSE2
headless rendering using EGL: False
So this appears to be a case where direct access to the X server is causing issues, but I’m at a loss what is going on. I tried increase verbosity on pvserver
to 9 but that doesn’t really give anything useful to me:
ESC[0mESC[2m( 43.004s) [pvserver ] vtkPVRenderView.cxx:1301 9| { RenderView1: UpdateESC[0m
ESC[0mESC[2m( 43.004s) [pvserver ] vtkPVView.cxx:445 9| . { RenderView1: update viewESC[0m
ESC[0mESC[2m( 43.004s) [pvserver ] vtkPVView.cxx:445 9| . } 0.000 s: RenderView1: update viewESC[0m
ESC[0mESC[2m( 43.004s) [pvserver ] vtkPVView.cxx:769 9| . { all-reduce (op=2)ESC[0m
ESC[0mESC[2m( 43.012s) [pvserver ] vtkPVView.cxx:838 9| . . source=0, result=0ESC[0m
ESC[0mESC[2m( 43.012s) [pvserver ] vtkPVView.cxx:769 9| . } 0.008 s: all-reduce (op=2)ESC[0m
ESC[0mESC[2m( 43.012s) [pvserver ] vtkPVView.cxx:661 9| . { all-reduce-boundsESC[0m
ESC[0mESC[2m( 43.020s) [pvserver ] vtkPVView.cxx:760 9| . . source=(invalid), result=(invalid)ESC[0m
ESC[0mESC[2m( 43.020s) [pvserver ] vtkPVView.cxx:661 9| . } 0.008 s: all-reduce-boundsESC[0m
ESC[0mESC[2m( 43.020s) [pvserver ] vtkPVRenderView.cxx:1301 9| } 0.016 s: RenderView1: UpdateESC[0m
ESC[0mESC[2m( 43.020s) [pvserver ] vtkPVRenderView.cxx:1451 9| { RenderView1: InteractiveRenderESC[0m
ESC[0mESC[2m( 43.020s) [pvserver ] vtkPVRenderView.cxx:1475 9| . { Render(interactive=true, skip_rendering=false)ESC[0m
ESC[0mESC[2m( 43.020s) [pvserver ] vtkPVRenderView.cxx:1555 9| . . use_lod=0, use_distributed_rendering=1, use_ordered_compositing=0ESC[0m
X Error of failed request: BadValue (integer parameter out of range for operation)
Major opcode of failed request: 150 (GLX)
Minor opcode of failed request: 3 (X_GLXCreateContext)
Value in failed request: 0x0
Serial number of failed request: 89
Current serial number in output stream: 90
A further test with running pvserver
under VirtualGL (to see if that could be a workaround) does work, but still takes a very long time in making the client-server connection, much more than I’m used to from earlier ParaView usage (might the VPN, though). So I still feel something is off. Also, the VirtualGL test is a bit meh, in that (due to our modules system and easybuild software stack) this also loads our Mesa module, which might have some weird interactions.
Any ideas what’s going on here, or where to look further?
P.S. Apparently you can trigger a segfault in pvpython by (mistakenly) passing an unknown option:
snellius paulm@gcn9 10:26 ~$ DISPLAY=:0.0 ~/software/ParaView-5.10.1-MPI-Linux-Python3.9-x86_64/bin/pvpython --verbose=9 pv_opengl_info.py
unknown option --verbose=9
usage: /gpfs/home4/paulm/software/ParaView-5.10.1-MPI-Linux-Python3.9-x86_64/bin/../lib/vtkpython [option] ... [-c cmd | -m mod | file | -] [arg] ...
Try `python -h' for more information.
Loguru caught a signal: SIGSEGV
Stack trace:
9 0x401f7f /gpfs/home4/paulm/software/ParaView-5.10.1-MPI-Linux-Python3.9-x86_64/bin/pvpython-real() [0x401f7f]
8 0x14e249092493 __libc_start_main + 243
7 0x402614 /gpfs/home4/paulm/software/ParaView-5.10.1-MPI-Linux-Python3.9-x86_64/bin/pvpython-real() [0x402614]
6 0x14e247c8199d vtkInitializationHelper::Finalize() + 93
5 0x14e246bb6ec2 vtkProcessModule::Finalize() + 258
4 0x14e243e4456a Py_FinalizeEx + 250
3 0x14e243e47682 _PyThreadState_DeleteExcept + 34
2 0x14e243e5b82e PyThread_acquire_lock_timed + 430
1 0x14e242ee8d1d sem_wait + 13
0 0x14e2490a6400 /lib64/libc.so.6(+0x37400) [0x14e2490a6400]
( 0.072s) [paraview ] :0 FATL| Signal: SIGSEGV
error: exception occurred: Segmentation fault
Edit: removed a test output, as I was running under TurboVNC, which (confusingly) these days provides GLX through Mesa.