Trying to figure out pvserver OpenGL issue

paulmelis · September 8, 2022, 8:51am

I’m using the 5.10.1 linux binaries on a compute/render node with four NVIDIA A100 GPUs. The X server setup on the node is working correctly, as I can use GPU-accelerated rendering in ParaView and e.g. Blender when running under VirtualGL, with both reporting an NVIDIA OpenGL device with the correct version, etc. Output of glxinfo for display :0.0 is also as expected:

name of display: :0.0
display: :0  screen: 0
direct rendering: Yes
server glx vendor string: NVIDIA Corporation
server glx version string: 1.4
server glx extensions:
    GLX_ARB_context_flush_control, GLX_ARB_create_context, 
    ...
client glx vendor string: NVIDIA Corporation, NVIDIA Corporation, NVIDIA Corporation, NVIDIA Corporation
client glx version string: 1.4
client glx extensions:
    GLX_ARB_context_flush_control, GLX_ARB_create_context, 
    ...
GLX version: 1.4
GLX extensions:
    GLX_ARB_context_flush_control, GLX_ARB_create_context, 
    GLX_ARB_create_context_no_error, GLX_ARB_create_context_profile, 
Memory info (GL_NVX_gpu_memory_info):
    Dedicated video memory: 40960 MB
    Total available memory: 40960 MB
    Currently available dedicated video memory: 40360 MB
OpenGL vendor string: NVIDIA Corporation
OpenGL renderer string: NVIDIA A100-SXM4-40GB/PCIe/SSE2
OpenGL core profile version string: 4.6.0 NVIDIA 515.43.04
OpenGL core profile shading language version string: 4.60 NVIDIA
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

But when I run pvserver on :0.0 it fails quite some time (35 seconds) after a connection from the client has been made, and the client is non-responsive all the time:

snellius paulm@gcn9 10:16 ~$ DISPLAY=:0.0 ~/software/ParaView-5.10.1-MPI-Linux-Python3.9-x86_64/bin/pvserver
Waiting for client...
Connection URL: cs://gcn9:11111
Accepting connection(s): gcn9:11111
Client connected.
<roughly 35 seconds>
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  150 (GLX)
  Minor opcode of failed request:  3 (X_GLXCreateContext)
  Value in failed request:  0x0
  Serial number of failed request:  89
  Current serial number in output stream:  90

Using --force-offscreen-rendering does not make a difference (in case this is an issue with no monitor attached to the node, even though fake EDID info is set explicitly in the xorg.conf). This sounds like there’s something going wrong in setting up the OpenGL context. To further figure out what’s going I’m using a few snippets of Python found on the forum here to report OpenGL info:

snellius paulm@gcn9 10:14 ~$ cat pv_opengl_info.py 
# https://discourse.paraview.org/t/is-egl-enabled-in-any-of-the-kitware-paraview-binaries/5247
from paraview.simple import *
o = GetOpenGLInformation()
print('vendor   :', o.GetVendor())
print('version  :', o.GetVersion())
print('renderer :', o.GetRenderer())
#print('capabilities', o.GetCapabilities())

from paraview.modules.vtkRemotingViews import *
renInfo = vtkPVRenderingCapabilitiesInformation()
renInfo.GetCapabilities()
renInfo.CopyFromObject(None)
print('headless rendering using EGL:', renInfo.Supports(vtkPVRenderingCapabilitiesInformation.HEADLESS_RENDERING_USES_EGL))

Running this under different conditions

# Directly on the X server: FAILS, same error as above
snellius paulm@gcn9 10:16 ~$ DISPLAY=:0.0 ~/software/ParaView-5.10.1-MPI-Linux-Python3.9-x86_64/bin/pvpython pv_opengl_info.py 
vendor   : NVIDIA Corporation
version  : 4.5.0 NVIDIA 515.43.04
renderer : NVIDIA A100-SXM4-40GB/PCIe/SSE2
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  150 (GLX)
  Minor opcode of failed request:  3 (X_GLXCreateContext)
  Value in failed request:  0x0
  Serial number of failed request:  89
  Current serial number in output stream:  90

# Under VirtualGL: SUCCESS
snellius paulm@gcn9 10:25 ~$ vglrun ~/software/ParaView-5.10.1-MPI-Linux-Python3.9-x86_64/bin/pvpython pv_opengl_info.py 
vendor   : NVIDIA Corporation
version  : 4.5.0 NVIDIA 515.43.04
renderer : NVIDIA A100-SXM4-40GB/PCIe/SSE2
headless rendering using EGL: False

So this appears to be a case where direct access to the X server is causing issues, but I’m at a loss what is going on. I tried increase verbosity on pvserver to 9 but that doesn’t really give anything useful to me:

ESC[0mESC[2m(  43.004s) [pvserver        ]    vtkPVRenderView.cxx:1301     9| { RenderView1: UpdateESC[0m
ESC[0mESC[2m(  43.004s) [pvserver        ]          vtkPVView.cxx:445      9| .   { RenderView1: update viewESC[0m
ESC[0mESC[2m(  43.004s) [pvserver        ]          vtkPVView.cxx:445      9| .   } 0.000 s: RenderView1: update viewESC[0m
ESC[0mESC[2m(  43.004s) [pvserver        ]          vtkPVView.cxx:769      9| .   { all-reduce (op=2)ESC[0m
ESC[0mESC[2m(  43.012s) [pvserver        ]          vtkPVView.cxx:838      9| .   .   source=0, result=0ESC[0m
ESC[0mESC[2m(  43.012s) [pvserver        ]          vtkPVView.cxx:769      9| .   } 0.008 s: all-reduce (op=2)ESC[0m
ESC[0mESC[2m(  43.012s) [pvserver        ]          vtkPVView.cxx:661      9| .   { all-reduce-boundsESC[0m
ESC[0mESC[2m(  43.020s) [pvserver        ]          vtkPVView.cxx:760      9| .   .   source=(invalid), result=(invalid)ESC[0m
ESC[0mESC[2m(  43.020s) [pvserver        ]          vtkPVView.cxx:661      9| .   } 0.008 s: all-reduce-boundsESC[0m
ESC[0mESC[2m(  43.020s) [pvserver        ]    vtkPVRenderView.cxx:1301     9| } 0.016 s: RenderView1: UpdateESC[0m
ESC[0mESC[2m(  43.020s) [pvserver        ]    vtkPVRenderView.cxx:1451     9| { RenderView1: InteractiveRenderESC[0m
ESC[0mESC[2m(  43.020s) [pvserver        ]    vtkPVRenderView.cxx:1475     9| .   { Render(interactive=true, skip_rendering=false)ESC[0m
ESC[0mESC[2m(  43.020s) [pvserver        ]    vtkPVRenderView.cxx:1555     9| .   .   use_lod=0, use_distributed_rendering=1, use_ordered_compositing=0ESC[0m
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  150 (GLX)
  Minor opcode of failed request:  3 (X_GLXCreateContext)
  Value in failed request:  0x0
  Serial number of failed request:  89
  Current serial number in output stream:  90

A further test with running pvserver under VirtualGL (to see if that could be a workaround) does work, but still takes a very long time in making the client-server connection, much more than I’m used to from earlier ParaView usage (might the VPN, though). So I still feel something is off. Also, the VirtualGL test is a bit meh, in that (due to our modules system and easybuild software stack) this also loads our Mesa module, which might have some weird interactions.

Any ideas what’s going on here, or where to look further?

P.S. Apparently you can trigger a segfault in pvpython by (mistakenly) passing an unknown option:

snellius paulm@gcn9 10:26 ~$ DISPLAY=:0.0 ~/software/ParaView-5.10.1-MPI-Linux-Python3.9-x86_64/bin/pvpython --verbose=9 pv_opengl_info.py 
unknown option --verbose=9
usage: /gpfs/home4/paulm/software/ParaView-5.10.1-MPI-Linux-Python3.9-x86_64/bin/../lib/vtkpython [option] ... [-c cmd | -m mod | file | -] [arg] ...
Try `python -h' for more information.

Loguru caught a signal: SIGSEGV
Stack trace:
9             0x401f7f /gpfs/home4/paulm/software/ParaView-5.10.1-MPI-Linux-Python3.9-x86_64/bin/pvpython-real() [0x401f7f]
8       0x14e249092493 __libc_start_main + 243
7             0x402614 /gpfs/home4/paulm/software/ParaView-5.10.1-MPI-Linux-Python3.9-x86_64/bin/pvpython-real() [0x402614]
6       0x14e247c8199d vtkInitializationHelper::Finalize() + 93
5       0x14e246bb6ec2 vtkProcessModule::Finalize() + 258
4       0x14e243e4456a Py_FinalizeEx + 250
3       0x14e243e47682 _PyThreadState_DeleteExcept + 34
2       0x14e243e5b82e PyThread_acquire_lock_timed + 430
1       0x14e242ee8d1d sem_wait + 13
0       0x14e2490a6400 /lib64/libc.so.6(+0x37400) [0x14e2490a6400]
(   0.072s) [paraview        ]                       :0     FATL| Signal: SIGSEGV
error: exception occurred: Segmentation fault

Edit: removed a test output, as I was running under TurboVNC, which (confusingly) these days provides GLX through Mesa.

paulmelis · September 8, 2022, 3:09pm

Tried to find out what parameters get passed to glXCreateContext, to see if that shows why it fails, but running under apitrace it actually calls glXCreateContextAttribsARB and then doesn’t fail…

// process.name = "/gpfs/home4/paulm/software/ParaView-5.10.1-MPI-Linux-Python3.9-x86_64/bin/paraview-real"
0 glXChooseFBConfig(dpy = 0x1515f390, screen = 0, attribList = {GLX_DRAWABLE_TYPE, GLX_WINDOW_BIT, GLX_RENDER_TYPE, GLX_RGBA_BIT, GLX_RED_SIZE, 1, GLX_GREEN_SIZE, 1, GLX_BLUE_SIZE, 1, GLX_DEPTH_SIZE, 1, GLX_ALPHA_SIZE, 1, GLX_DOUBLEBUFFE
R, True, 0}, nitems = &30) = {0x154ad100, 0x15f05980, 0x15f059a0, 0x15f059c0, 0x15f05a20, 0x15f05a80, 0x15f05ae0, 0x15f05b40, 0x154ac7b0, 0x154ac7d0, 0x15f05300, 0x15f05320, 0x15285800, 0x15285820, 0x15f05480, 0x15f054a0, 0x152bd350, 0x1
52bd370, 0x15f053c0, 0x15f053e0, 0x154ac770, 0x154ac790, 0x15f05540, 0x15f055a0, 0x15f05600, 0x15f05660, 0x15f056c0, 0x15f05720, 0x15f05780, 0x15f057a0}
1 glXGetVisualFromFBConfig(dpy = 0x1515f390, config = 0x154ad100) = &{visual = 0x155191e0, visualid = 40, screen = 0, depth = 24, c_class = 4, red_mask = 16711680, green_mask = 65280, blue_mask = 255, colormap_size = 256, bits_per_rgb = 
11}
4 glXCreateContextAttribsARB(dpy = 0x1515f390, config = 0x154ad100, share_context = NULL, direct = True, attrib_list = {GLX_CONTEXT_MAJOR_VERSION_ARB, 4, GLX_CONTEXT_MINOR_VERSION_ARB, 5, 0}) = 0x1601cc28
6 glXMakeCurrent(dpy = 0x1515f390, drawable = 4194309, ctx = 0x1601cc28) = True
7 glViewport(x = 0, y = 0, width = 300, height = 300) // fake
8 glScissor(x = 0, y = 0, width = 300, height = 300) // fake
12 glGetIntegerv(pname = GL_NUM_EXTENSIONS, params = &405)
2501 glGetFloatv(pname = GL_ALIASED_LINE_WIDTH_RANGE, params = {1, 10})
2503 glGetIntegerv(pname = GL_MAX_TEXTURE_IMAGE_UNITS, params = &32)
2505 glEnable(cap = GL_BLEND)
2506 glDisable(cap = GL_CULL_FACE)
2507 glEnable(cap = GL_DEPTH_TEST)
2508 glDisable(cap = GL_LINE_SMOOTH)
2510 glDisable(cap = GL_STENCIL_TEST)

paulmelis · September 19, 2022, 10:43am

Okay, to make sure this isn’t some issue with the official binaries I did a superbuild build (config below) to compare with. With the superbuild pvpython I get the same X error when running on one of our GPU nodes:

# Test script
snellius paulm@gcn8 12:40 ~/c/paraview-superbuild-build/install/bin$ cat ~/pv_cone_shrink.py
from paraview.simple import *
Cone()
SetProperties(Resolution=32)
Shrink() 
Show() 
Render() 
WriteImage('cone_shrink.png')

# X error
snellius paulm@gcn8 12:30 ~/c/paraview-superbuild-build/install/bin$ DISPLAY=:0.0 ./pvpython --force-offscreen-rendering ~/pv_cone_shrink.py 
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  150 (GLX)
  Minor opcode of failed request:  3 (X_GLXCreateContext)
  Value in failed request:  0x0
  Serial number of failed request:  90
  Current serial number in output stream:  91

Build configuration:

#/bin/bash
cmake \
    -DCMAKE_BUILD_TYPE=Release \
    -DCMAKE_INSTALL_PREFIX=$HOME/software/paraview-superbuild-5.10.1 \
    -DPARAVIEW_BUILD_EDITION=RENDERING \
    -DENABLE_hdf5=ON \
    -DENABLE_qt5=ON \
    -DENABLE_python3=ON \
    -GNinja \
    ../paraview-superbuild-git

The ENABLE_qt5 and ENABLE_hdf5 are probably not needed, but I had planned to do some other tests as well. Plus, I’m still trying to figure out how to get the GUI to build.

paulmelis · September 19, 2022, 11:42am

More data points, by trying the cone rendering test (DISPLAY=:0.0 pvpython --force-offscreen-rendering pv_cone_shrink.py) with the official binaries on a different HPC system, same error. So I’m seeing this on:

Snellius GPU node gcn8: RHEL 8.4, NVIDIA A100 (driver 515.43.04), X.org 1.20.10
Lisa GPU node r34n7: Debian 10, NVIDIA Titan RTX (driver 470.103.01), X.org 1.20.4

But for some reason I cannot reproduce the issue on my local workstations:

Juggle workstation, Arch Linux, NVIDIA Geforce GTX970 (driver 515.57), X.org 1.21.1.4
Laptop, Arch Linux, NVIDIA Quadro T2000 (driver 515.65.01), X.org 1.21.1.4

Apart from differences in the Linux distro, one difference could be the lack of physical display attached to the GPU nodes. We fake a connection using an EDID file of a high-res Dell monitor in the x.org config:

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "ConnectedMonitor" "DFP-0"
    Option         "CustomEDID" "DFP-0: /etc/X11/dell-3008wfp.bin"
    SubSection     "Display"
        Depth       24
        Modes      "2560x1600"
    EndSubSection
EndSection

This allows several high resolutions to be supported for applications that aren’t aware they’re not actually outputting to a real display:

snellius paulm@gcn8 13:17 ~/c/paraview-superbuild-build$ DISPLAY=:0.0 xrandr -q
Screen 0: minimum 8 x 8, current 2560 x 1600, maximum 2560 x 1600
DVI-D-0 connected primary 2560x1600+0+0 (normal left inverted right x axis y axis) 641mm x 400mm
   2560x1600     59.86*+  59.99    59.97  
   2560x1440     59.99    59.99    59.96    59.95  
   2048x1536     85.00    75.00    60.00  
   2048x1152     59.99    59.98    59.91    59.90  
   1920x1440     85.00    75.00    60.00  
   1920x1200     59.95    59.88  
   1920x1080     60.01    59.97    59.96    59.93  
   1856x1392     75.00    60.01  
   1792x1344     75.00    60.01  
   1680x1050     59.95    59.88  
   1600x1200     85.00    75.00    70.00    65.00    60.00  
   1600x900      59.99    59.94    59.95    59.82  
   1440x810      60.00    59.97  
   1400x1050     74.76    59.98  
   1400x900      59.96    59.88  
...

Hmmm, maybe that EDID info is somehow causing issues. If I copy the Screen section on my laptop, including custom EDID file, I indeed get the same error now on my laptop:

paulm@l0420007 13:30:~$ DISPLAY=:0.0 ~/software/ParaView-5.10.1-MPI-Linux-Python3.9-x86_64/bin/pvpython --force-offscreen-rendering ~/pv_cone_shrink.py 
VisRTX 0.1.6, using devices:
 0: Quadro T2000 with Max-Q Design (Total: 4.1 GB, Available: 4.0 GB)
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  152 (GLX)
  Minor opcode of failed request:  3 (X_GLXCreateContext)
  Value in failed request:  0x0
  Serial number of failed request:  75
  Current serial number in output stream:  76

Testing some more with leaving out the Display subsection, ConnectedMonitor option and/or CustomEDID option shows neither of those are the problem. If I leave them all out, i.e. back to the original X config, and simply log into the laptop remotely (which I needed to do override the attached monitor) ParaView pvpython has already trouble getting an OpenGL context.

paulmelis · September 19, 2022, 11:59am

More tests, still don’t have a clear picture (pun intended). Only testing on my Arch Linux laptop with official 5.10.1 binaries here:

When logged into it through lightdm (i.e. regular desktop) running DISPLAY=:0.0 pvpython --force-offscreen-rendering pv_cone_shrink.py in a termainl window succeeds, as expected.
While still leaving the lightdm session active, using SSH to log into it and running the script also succeeds, interesting.
Stopped lightdm from running, logged into a console window and started X as regular user. No image on the laptop display, but running pvpython through SSH: X error.

This is starting to feel more-and-more like some kind of authentication/permissions error. The strange thing I don’t understand is why only Paraview seems to have this issue. Trying to run some other OpenGL applications directly on :0.0 gives no errors, e.g. glxinfo, glxgears, blender (although Blender does get stuck when trying to exit after rendering), or the VTK example here.

Even paraview --script manages to execute the cone render script and write the output image (but also gets stuck on exit), while pvbatch, pvpython and pvserver give the X error.

Btw, on a Snellius GPU node with paraview --script the cone (and opengl info) script also gets stuck on exit, so at least that’s consistent.

paulmelis · September 19, 2022, 5:34pm

@utkarsh.ayachit Sorry to ping you directly, but would you have any guidance on figuring out what’s going wrong here?

mwestphal · September 20, 2022, 1:33am

Hi @paulmelis

Why using VirtualGL when you could use the EGL binary release ?

Best,

paulmelis · September 20, 2022, 6:44am

The vglrun test above was merely to see if that also showed the X error (and I was surprised to see the error did not occur in that case). It seems the non-GUI Paraview binaries do something funky in OpenGL context creation, as I see 3 different connections to the X server come by when tracing the calls.

Regarding EGL, we haven’t set up the necessary permissions on the relevant device files (e.g. /dev/dri/card?) on the GPU nodes yet. I still need to investigate if there’s any security issues there when having multiple users on the same GPU node. Plus, it makes using the correct device file slightly more involved, as those files can’t be hidden with cgroups (on a shared node we only hand out the GPU devices you allocate through SLURM).

mwestphal · September 20, 2022, 9:13am

tbh ParaView is not tested in virtualgl context afaik ( @ben.boeckel ) so you may be in uncharted territories.

EGL is the preferred way to handle this usecase.

woodscn · December 19, 2023, 10:20pm

@paulmelis
We also use Paraview in a VGL context. Unfortunately, I don’t have a lot more to contribute, other than to say we’re interested in the same sorts of problems.

In our case, we got “display not accessible” messages, even though we were launching pvserver from an osmesa build. Our working theory was that running the paraview client through vgl was causing something odd to be inherited by spawned pvserver processes when launched from within the paraview client. We haven’t yet figured out a resolution.