pvbatch with EGL device on HPC

jirikolar · March 23, 2022, 10:17am

Hi,
I compiled paraview in headless mode with EGL support. However, I can use it only if PBS scheduler assigns me the very first GPU in the node (index 0). If it assigns me any other card, despite it is visible within the job as the device with index 0, the EGL device can not be initialized.

Do you have any idea what might be wrong?

Additionally, I have found no way to specify the EGL device index manually, I always tried variable $CUDA_VISIBLE_DEVICES or the index according to nvidia-smi:

pvbatch --egl-device-index=$CUDA_VISIBLE_DEVICES test.py
pvbatch --egl-device-index=2 test.py
Error: unknown option --egl-device-index=…

pvbatch test.py -display=$CUDA_VISIBLE_DEVICES
Not working

pvbatch --displays=$CUDA_VISIBLE_DEVICES test.py
Error No script specified. Please specify a batch script

If I am assigned the very first GPU, then, without problems, I can just run:

pvbatch test.py

Paraview was compiled with success using:

cmake -GNinja -DCMAKE_BUILD_TYPE=Release -DPARAVIEW_USE_PYTHON=ON -DPARAVIEW_USE_MPI=ON -DVTK_SMP_IMPLEMENTATION_TYPE=TBB -DVTK_OPENGL_HAS_EGL=ON -DVTK_OPENGL_HAS_OSMESA=OFF -DPARAVIEW_USE_QT=OFF -DVTK_USE_X=OFF -DEGL_opengl_LIBRARY:FILEPATH=/usr/lib/x86_64-linux-gnu/libOpenGL.so.0 -DEGL_INCLUDE_DIR=/usr/include/EGL/ -DEGL_LIBRARY=/usr/lib/x86_64-linux-gnu/nvidia/current/libEGL_nvidia.so.0 -DTBB_ROOT=$TBBROOT -DPARAVIEW_ENABLE_FFMPEG=ON …/…/paraview
ninja -j4

Thank you for your help.
Jiri

utkarsh.ayachit · March 23, 2022, 12:21pm

Try pvbatch --displays=$CUDA_VISIBLE_DEVICES -- test.py.

jirikolar · March 23, 2022, 12:48pm

Now it can see the script, but the resulting error with EGL is still the same:

( 1.100s) [pvbatch ] vtkEGLRenderWindow.cxx:353 WARN| vtkEGLRenderWindow (0x556bb0976040): EGL device index: 0 could not be initialized. Trying other devices…
( 1.105s) [pvbatch ] vtkEGLRenderWindow.cxx:386 WARN| vtkEGLRenderWindow (0x555f5634a070): Setting an EGL display to device index: -1 require EGL_EXT_device_base EGL_EXT_platform_device EGL_EXT_platform_base extensions
( 1.105s) [pvbatch ] vtkEGLRenderWindow.cxx:388 WARN| vtkEGLRenderWindow (0x555f5634a070): Attempting to use EGL_DEFAULT_DISPLAY…
( 1.105s) [pvbatch ] vtkEGLRenderWindow.cxx:393 ERR| vtkEGLRenderWindow (0x555f5634a070): Could not initialize a device. Exiting…
( 1.105s) [pvbatch ]vtkOpenGLRenderWindow.c:511 ERR| vtkEGLRenderWindow (0x555f5634a070): GLEW could not be initialized: Missing GL version

Regardless of what I supply to the display, it attempts to use device index 0. This should be correct, as PBS gives me resources in a separate namespace, in nvidia-smi output, there is only one GPU visible, with index 0. But when I ssh directly to the cluster and use nvidia-smi, there are 8 different GPUs and for example, what I see from the PBS job as index 0, is in fact GPU with index 7 within whole machine.

I assume that problem is that paraview tries to use EGL by index and not by persistent address, e.g. Bus-ID, or the ID of the GPU which is in the $CUDA_VISIBLE_DEVICES variable.

jirikolar · March 23, 2022, 12:50pm

According to bus-id (00000000:C2:00.0) I was assigned GPU with id number 7.

nvidia-smi (in PBS job namespace)

±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================|
| 0 RTX A4000 On | 00000000:C2:00.0 Off | Off |
| 41% 26C P8 15W / 140W | 1MiB / 16117MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

nvidia-smi for the whole node

±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.73.01 Driver Version: 460.73.01 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================|
| 0 RTX A4000 On | 00000000:01:00.0 Off | Off |
| 41% 42C P2 42W / 140W | 2260MiB / 16117MiB | 13% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 1 RTX A4000 On | 00000000:21:00.0 Off | Off |
| 41% 25C P8 14W / 140W | 1MiB / 16117MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 2 RTX A4000 On | 00000000:22:00.0 Off | Off |
| 41% 27C P8 15W / 140W | 1MiB / 16117MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 3 RTX A4000 On | 00000000:41:00.0 Off | Off |
| 47% 65C P2 86W / 140W | 14882MiB / 16117MiB | 23% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 4 RTX A4000 On | 00000000:81:00.0 Off | Off |
| 41% 26C P8 13W / 140W | 1MiB / 16117MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 5 RTX A4000 On | 00000000:A1:00.0 Off | Off |
| 41% 42C P2 45W / 140W | 2258MiB / 16117MiB | 13% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 6 RTX A4000 On | 00000000:C1:00.0 Off | Off |
| 41% 28C P8 13W / 140W | 1MiB / 16117MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+
| 7 RTX A4000 On | 00000000:C2:00.0 Off | Off |
| 41% 25C P8 15W / 140W | 1MiB / 16117MiB | 0% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

utkarsh.ayachit · March 23, 2022, 6:29pm

@danlipsa, do you know if this is indeed the case?

danlipsa · March 23, 2022, 6:39pm

Indeed that is the case. We get a list of available devices from EGL and we use the one with the specified index.

jirikolar · March 23, 2022, 7:31pm

Well, I really don’t know what is happening, this is just my guess. From what I tested, on the same machine, EGL could be initialized when I was assigned the first GPU card and it could not if I was assigned any other one. This I have verified on three different machines, same behavior.

jirikolar · March 23, 2022, 8:51pm

So, after consulting our HPC admin, it seems that the problem is a bug in the GPU driver.

Problematic machines:

Driver Version: 460.73.01 CUDA Version: 11.2

Newer HPC machines with no issues:

Driver Version: 470.103.01 CUDA Version: 11.4