pvserver nvidia docker - connection issue

Nicholas_Yue · January 21, 2025, 5:32am

I am trying to troubleshoot the following error and appreciate suggestion on where to start

I have ensure the OS on both client/server end are the same (Ubuntu 22.04)

I have ensure that both paraview and pvserver are from the same version 5.11.0

Thank you in advanced.

Status: Downloaded newer image for nvcr.io/nvidia-hpcvis/paraview:egl-py3-5.11.0
level=info msg="Setting CUDA_MODULE_LOADING=LAZY. Please see https://docs.nvidia.com for more information."
(   0.031s) [pvserver        ]SurfaceLICPluginAutoloa:25    WARN| SurfaceLIC is now built-in ParaView. There is no need to load this plugin.
-----------------------------------------------------
  By loading the 'pvNVIDIAIndeX' plugin you have accepted the EULA shipped with it.
  If that is not acceptable, please restart the application without loading 
  the 'pvNVIDIAIndeX' plugin.
-----------------------------------------------------
Waiting for client...
Connection URL: cs://1857fc7c1e13:11111
Accepting connection(s): 1857fc7c1e13:11111
Client connected.
Exiting...
( 558.143s) [pvserver        ]vtkSocketCommunicator.c:783    ERR| vtkSocketCommunicator (0x5636a966f050): Could not receive tag. 1
( 558.143s) [pvserver        ]vtkTCPNetworkAccessMana:295    ERR| vtkTCPNetworkAccessManager (0x5636a9483650): Some error in socket processing.

mwestphal · January 21, 2025, 5:56am

Where is this ParaView coming from ? Is that a docker image ? Do you use the same image client side ?

Nicholas_Yue · January 21, 2025, 6:03am

ParaView came from here

I use nvidia docker on the server side because the machine has a Nvidia GPU

I use a standard ParaView on the client side as it is just a normal Ubuntu 22.04 box

mwestphal · January 21, 2025, 6:07am

Does this image ships a ParaView binary or do you use official ParaView binary there too ?

Nicholas_Yue · January 21, 2025, 6:10am

The docker I obtain from nvidia only has pvserver, for paraview, I use the official binaries

Nicholas_Yue · January 21, 2025, 6:16am

I have now repeated the test with both the pvserver (egl build) and the paraview using official binaries and same version to help isolate the problem.

The pvserver accepted the client connection (after a very long pause) but the pvserver is having issue with obtaining EGL

mwestphal · January 21, 2025, 10:15am

Please check this is working:

Nicholas_Yue · January 22, 2025, 12:09am

It is not working

ubuntu@vm-c-bethelsight-trial-gpu:~/systems/ParaView/ParaView-5.13.1-egl-MPI-Linux-Python3.10-x86_64/bin$ ./pvpython
Python 3.10.13 (main, Sep 27 2024, 19:28:01) [GCC 10.2.1 20210130 (Red Hat 10.2.1-11)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from paraview.simple import *
GetOpenGLInformation()

>>> libEGL warning: egl: failed to create dri2 screen
libEGL warning: egl: failed to create dri2 screen
(  24.606s) [paraview        ] vtkEGLRenderWindow.cxx:340   WARN| vtkEGLRenderWindow (0x4937700): EGL device index: 0 could not be initialized. Trying other devices...
<paraview.modules.vtkRemotingViews.vtkPVOpenGLInformation(0x46c6c90) at 0x7faa697f9d20>
>>> >>>

mwestphal · January 22, 2025, 3:23pm

Please check eglinfo is working as expected.

Nicholas_Yue · January 22, 2025, 4:00pm

Although nvida-smi returns RTX A5000, eglinfo seems to be unaware of a GPU present

In particular, eglinfo cannot find a GPU device

Device platform:
eglinfo: eglInitialize failed

I have attached both output files:

nvidia-smi.log (1.7 KB)

eglinfo.log (6.5 KB)

jourdain · January 23, 2025, 12:00am

I think it is an issue with the nvidia driver on the host.

Can you try to install a 535.x version instead?

Nicholas_Yue · January 23, 2025, 12:36am

I have downgraded the driver to 535, still having EGL errors

0x43 48  0 16 16 16  0 16  0  4 1 0x48344258--         y  y  y     win
0x44 48  0 16 16 16  0 24  0  4 1 0x48344258--         y  y  y     win
0x45 48  0 16 16 16  0 24  8  4 1 0x48344258--         y  y  y     win
0x46 48  0 16 16 16  0 32  0  4 1 0x48344258--         y  y  y     win

Wayland platform:
eglinfo: eglInitialize failed

X11 platform:
eglinfo: eglInitialize failed

Device platform:
eglinfo: eglInitialize failed

Thu Jan 23 00:35:48 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.216.03             Driver Version: 535.216.03   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A5000               Off | 00000000:00:10.0 Off |                  Off |
| 30%   35C    P0              75W / 230W |      0MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

mwestphal · January 23, 2025, 8:40am

Then I’m afraid this is not a ParaView issue

jourdain · January 23, 2025, 3:52pm

This is strange indeed, do you set the ENV variable to NVIDIA_DRIVER_CAPABILITIES=all?

Nicholas_Yue · January 23, 2025, 4:06pm

Setting the environment makes no difference.

@jourdain is there some OS version, kernel version combination that is important for pvserver to operate successfully ?

jourdain · January 23, 2025, 10:31pm

Not really, you just need to make sure the system can get an EGL device. And that is provided by your nvidia driver and OS.

We have some notes here on similar issue. But that is about it. I know that NVidia is looking into the driver issue but I don’t have a timeline as they need to figure out the origin of the problem which could be linux distribution related.

Nicholas_Yue · January 23, 2025, 10:53pm

Thank you @jourdain , the note is very useful. I will give the docker route a try and report back.

Thank you both @jourdain and @mwestphal for the patience.

Nicholas_Yue · January 23, 2025, 11:02pm

Still getting the eglinfo error.

I will work with the Cloud GPU provider to see if they have other possibilities I could explore.

Processing triggers for libc-bin (2.35-0ubuntu3.1) ...
EGL client extensions string:
    EGL_EXT_device_base EGL_EXT_device_enumeration EGL_EXT_device_query
    EGL_EXT_platform_base EGL_KHR_client_get_all_proc_addresses
    EGL_EXT_client_extensions EGL_KHR_debug EGL_EXT_platform_device
    EGL_EXT_platform_wayland EGL_KHR_platform_wayland
    EGL_EXT_platform_x11 EGL_KHR_platform_x11 EGL_EXT_platform_xcb
    EGL_MESA_platform_gbm EGL_KHR_platform_gbm
    EGL_MESA_platform_surfaceless

GBM platform:
eglinfo: eglInitialize failed

Wayland platform:
error: XDG_RUNTIME_DIR not set in the environment.
error: XDG_RUNTIME_DIR not set in the environment.
eglinfo: eglInitialize failed

X11 platform:
eglinfo: eglInitialize failed

Device platform:
eglinfo: eglInitialize failed

Nicholas_Yue · January 24, 2025, 8:47pm

This has been resolved. There is an additional nvidia gl package libnvidia-gl-535-server that needs to be installed for the EGL part to work.

Thank you once again to @mwestphal and @jourdain for your help and patience.

jourdain · January 26, 2025, 5:57pm

Out of curiosity, do you manage to get things working with newer nvidia drivers?