GPU selection / use in ParaView 5.11.0 with EGL support

jgrime · October 2, 2023, 7:37pm

I’m trying to control the GPU use when ParaView is operating in server mode. The node I’m running pvserver on has two P100 GPUs, as indicated by nvidia-smi:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-PCIE-16GB           Off | 00000000:07:00.0 Off |                    0 |
| N/A   35C    P0              28W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE-16GB           Off | 00000000:81:00.0 Off |                    0 |
| N/A   36C    P0              26W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

However, no matter how I try to use the --displays switch on the command line (which I presume is the replacement for deprecated --egl-device-index switch), I only ever seem to have GPU 0 being used, e.g.:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-PCIE-16GB           Off | 00000000:07:00.0 Off |                    0 |
| N/A   35C    P0              34W / 250W |      3MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE-16GB           Off | 00000000:81:00.0 Off |                    0 |
| N/A   36C    P0              26W / 250W |      0MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      7539      G   ...-Python3.9-x86_64/bin/pvserver-real        3MiB |
+---------------------------------------------------------------------------------------+

What is the correct way to notify pvserver to use a specific EGL device?

danlipsa · October 2, 2023, 11:31pm

We had a bug fix recently which isn’t in 5.11

https://gitlab.kitware.com/paraview/paraview/-/merge_requests/6307/diffs

Can you try a recent nightly instead?

jgrime · October 3, 2023, 1:46pm

Hi Dan,

Does that recent nightly work with the Windows 5.11.1 client that can be downloaded from the ParaView website?

For the record, I got this working via setting environment variables that are passed into separate processes through MPI, e.g.:

mpiexec -env VTK_DEFAULT_EGL_DEVICE_INDEX=0 bin/pvserver : -env VTK_DEFAULT_EGL_DEVICE_INDEX=1 bin/pvserver

jgrime:~/paraview_egl$ nvidia-smi
Mon Oct  2 20:33:05 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P100-PCIE-16GB           Off | 00000000:07:00.0 Off |                    0 |
| N/A   35C    P0              28W / 250W |    187MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P100-PCIE-16GB           Off | 00000000:81:00.0 Off |                    0 |
| N/A   36C    P0              27W / 250W |    187MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      8269      G   ...-Python3.9-x86_64/bin/pvserver-real      187MiB |
|    1   N/A  N/A      8270      G   ...-Python3.9-x86_64/bin/pvserver-real      187MiB |
+---------------------------------------------------------------------------------------+

Is there any tooling in pvserver to examine how the two GPUs are being used in the different processes? It would be nice to identify any bottlenecks for various operations!

danlipsa · October 3, 2023, 2:16pm

Hi Dan,

Does that recent nightly work with the Windows 5.11.1 client that can be downloaded from the ParaView website?

No, you’ll have to use both the server and the client from the nightly download - server and client have to be matched on the major version.

For the record, I got this working via setting environment variables that are passed into separate processes through MPI, e.g.:

mpiexec -env VTK_DEFAULT_EGL_DEVICE_INDEX=0 bin/pvserver : -env VTK_DEFAULT_EGL_DEVICE_INDEX=1 bin/pvserver

Great! Seems this is a different code path that was not affected by the bug.

jgrime:~/paraview_egl$ nvidia-smi
Mon Oct  2 20:33:05 2023
+---------------------------------------------------------------------------------------+
> NVIDIA-SMI 535.113.01             Driver Version: 535.113.01   CUDA Version: 12.2     |
>-----------------------------------------+----------------------+----------------------+
> GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
> Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
>                                         >                      >               MIG M. |
>=========================================+======================+======================|
>   0  Tesla P100-PCIE-16GB           Off | 00000000:07:00.0 Off |                    0 |
> N/A   35C    P0              28W / 250W |    187MiB / 16384MiB |      0%      Default |
>                                         >                      >                  N/A |
+-----------------------------------------+----------------------+----------------------+
>   1  Tesla P100-PCIE-16GB           Off | 00000000:81:00.0 Off |                    0 |
> N/A   36C    P0              27W / 250W |    187MiB / 16384MiB |      0%      Default |
>                                         >                      >                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
> Processes:                                                                            |
>  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
>        ID   ID                                                             Usage      |
>=======================================================================================|
>    0   N/A  N/A      8269      G   ...-Python3.9-x86_64/bin/pvserver-real      187MiB |
>    1   N/A  N/A      8270      G   ...-Python3.9-x86_64/bin/pvserver-real      187MiB |
+---------------------------------------------------------------------------------------+

Is there any tooling in pvserver to examine how the two GPUs are being used in the different processes? It would be nice to identify any bottlenecks for various operations!

No, there isn’t AFAIK. Even for vtkm code that uses the GPU for computation we use nvidia-smi and nsys-ui to check usage of the GPU.
@Vicente Bolea

jgrime · October 3, 2023, 2:43pm

Hi Dan,

You may want to also check out nvtop - I started using it a few hours ago and I’ve found it rather useful so far!

J.

danlipsa · October 3, 2023, 3:26pm

Thanks for the suggestion - looks very cool. I’ll check it out.

Dan

vbolea · October 16, 2023, 2:10pm

Yes, we normally use nvidia-smi and nsys utilities. Another possibility as a last resort is to use GDB. if it is built with symbols, you can breakpoint on host routines that call CUDA device routines. We have done this at times.