pvserver nvidia docker - connection issue

I am trying to troubleshoot the following error and appreciate suggestion on where to start

I have ensure the OS on both client/server end are the same (Ubuntu 22.04)

I have ensure that both paraview and pvserver are from the same version 5.11.0

Thank you in advanced.

Status: Downloaded newer image for nvcr.io/nvidia-hpcvis/paraview:egl-py3-5.11.0
level=info msg="Setting CUDA_MODULE_LOADING=LAZY. Please see https://docs.nvidia.com for more information."
(   0.031s) [pvserver        ]SurfaceLICPluginAutoloa:25    WARN| SurfaceLIC is now built-in ParaView. There is no need to load this plugin.
-----------------------------------------------------
  By loading the 'pvNVIDIAIndeX' plugin you have accepted the EULA shipped with it.
  If that is not acceptable, please restart the application without loading 
  the 'pvNVIDIAIndeX' plugin.
-----------------------------------------------------
Waiting for client...
Connection URL: cs://1857fc7c1e13:11111
Accepting connection(s): 1857fc7c1e13:11111
Client connected.
Exiting...
( 558.143s) [pvserver        ]vtkSocketCommunicator.c:783    ERR| vtkSocketCommunicator (0x5636a966f050): Could not receive tag. 1
( 558.143s) [pvserver        ]vtkTCPNetworkAccessMana:295    ERR| vtkTCPNetworkAccessManager (0x5636a9483650): Some error in socket processing.

Where is this ParaView coming from ? Is that a docker image ? Do you use the same image client side ?

ParaView came from here

I use nvidia docker on the server side because the machine has a Nvidia GPU

I use a standard ParaView on the client side as it is just a normal Ubuntu 22.04 box

Does this image ships a ParaView binary or do you use official ParaView binary there too ?

The docker I obtain from nvidia only has pvserver, for paraview, I use the official binaries

I have now repeated the test with both the pvserver (egl build) and the paraview using official binaries and same version to help isolate the problem.

The pvserver accepted the client connection (after a very long pause) but the pvserver is having issue with obtaining EGL

Please check this is working:

It is not working

ubuntu@vm-c-bethelsight-trial-gpu:~/systems/ParaView/ParaView-5.13.1-egl-MPI-Linux-Python3.10-x86_64/bin$ ./pvpython
Python 3.10.13 (main, Sep 27 2024, 19:28:01) [GCC 10.2.1 20210130 (Red Hat 10.2.1-11)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from paraview.simple import *
GetOpenGLInformation()

>>> libEGL warning: egl: failed to create dri2 screen
libEGL warning: egl: failed to create dri2 screen
(  24.606s) [paraview        ] vtkEGLRenderWindow.cxx:340   WARN| vtkEGLRenderWindow (0x4937700): EGL device index: 0 could not be initialized. Trying other devices...
<paraview.modules.vtkRemotingViews.vtkPVOpenGLInformation(0x46c6c90) at 0x7faa697f9d20>
>>> >>>

Please check eglinfo is working as expected.

Although nvida-smi returns RTX A5000, eglinfo seems to be unaware of a GPU present

In particular, eglinfo cannot find a GPU device

Device platform:
eglinfo: eglInitialize failed

I have attached both output files:

nvidia-smi.log (1.7 KB)

eglinfo.log (6.5 KB)

I think it is an issue with the nvidia driver on the host.

Can you try to install a 535.x version instead?

I have downgraded the driver to 535, still having EGL errors

0x43 48  0 16 16 16  0 16  0  4 1 0x48344258--         y  y  y     win
0x44 48  0 16 16 16  0 24  0  4 1 0x48344258--         y  y  y     win
0x45 48  0 16 16 16  0 24  8  4 1 0x48344258--         y  y  y     win
0x46 48  0 16 16 16  0 32  0  4 1 0x48344258--         y  y  y     win

Wayland platform:
eglinfo: eglInitialize failed

X11 platform:
eglinfo: eglInitialize failed

Device platform:
eglinfo: eglInitialize failed
Thu Jan 23 00:35:48 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A5000               Off | 00000000:00:10.0 Off |                  Off |
| 30%   35C    P0              75W / 230W |      0MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Then I’m afraid this is not a ParaView issue

This is strange indeed, do you set the ENV variable to NVIDIA_DRIVER_CAPABILITIES=all?

Setting the environment makes no difference.

@jourdain is there some OS version, kernel version combination that is important for pvserver to operate successfully ?

Not really, you just need to make sure the system can get an EGL device. And that is provided by your nvidia driver and OS.

We have some notes here on similar issue. But that is about it. I know that NVidia is looking into the driver issue but I don’t have a timeline as they need to figure out the origin of the problem which could be linux distribution related.

Thank you @jourdain , the note is very useful. I will give the docker route a try and report back.

Thank you both @jourdain and @mwestphal for the patience.

Still getting the eglinfo error.

I will work with the Cloud GPU provider to see if they have other possibilities I could explore.

Processing triggers for libc-bin (2.35-0ubuntu3.1) ...
EGL client extensions string:
    EGL_EXT_device_base EGL_EXT_device_enumeration EGL_EXT_device_query
    EGL_EXT_platform_base EGL_KHR_client_get_all_proc_addresses
    EGL_EXT_client_extensions EGL_KHR_debug EGL_EXT_platform_device
    EGL_EXT_platform_wayland EGL_KHR_platform_wayland
    EGL_EXT_platform_x11 EGL_KHR_platform_x11 EGL_EXT_platform_xcb
    EGL_MESA_platform_gbm EGL_KHR_platform_gbm
    EGL_MESA_platform_surfaceless

GBM platform:
eglinfo: eglInitialize failed

Wayland platform:
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
eglinfo: eglInitialize failed

X11 platform:
eglinfo: eglInitialize failed

Device platform:
eglinfo: eglInitialize failed

This has been resolved. There is an additional nvidia gl package libnvidia-gl-535-server that needs to be installed for the EGL part to work.

Thank you once again to @mwestphal and @jourdain for your help and patience.

1 Like

Out of curiosity, do you manage to get things working with newer nvidia drivers?