Why OSMESA is faster than EGL?

rsantos · February 14, 2024, 2:50pm

Hi,

I am supporting a person using an HPC system and we finally could load ParaView in a way that uses GPU acceleration.
I’ve done some benchmarks but, I am not understanding why the “osmesa” version loads a 1GB file faster than the “egl version”. According to this page, it should be the other way around: Difference between EGL and OSMESA - ParaView Support - ParaView

The system uses Nvidia A100 and P100 GPUs. I also ran the server with srun ./pvserver --server-port=7755 --force-offscreen-rendering (more details about the machine can be seen here: Saga — Sigma2 documentation

Local (no server, just load inside my local laptop): 10 seconds
Installed version on server (5.10.1 MPI): 01m49s

a100 gpu
osmesa (no extra flags): 20 seconds
osmesa (flag --force-offscreen-rendering): 18 seconds
egl (flag --force-offscreen-rendering): 24 seconds

p100 gpu
Installed version: 01m42s
osmesa (no extra flags): 51 seconds
osmesa (flag --force-offscreen-rendering): 48 seconds
egl (flag --force-offscreen-rendering): 48 seconds

As you can see, the difference is not that big from “osmesa” and “egl” but still, the latest should be considerably faster because it uses NVIDIA drivers instead of the CPU acceleration, no?

Or “osmesa” version is using the open source nouveau driver and that explains why the results are similar?

Here’s a screenshot running on a A100 GPU partition with the “osmesa” package:

Please, let me know if there’s any other details I need to provide.

Thank you!

jourdain · February 14, 2024, 3:10pm

EGL is not guaranteed to use your NVidia GPU. For a long time, nvidia was the only implementation, which means that if it was working, it was using the gpu. But now, it can fallback to a mesa implementation which is obviously not as fast.

I personally ran into issue within docker and nvidia driver version that broke their EGL which lead to such fallback. The worse is that you don’t realize that right aways since it is still “working”.

Either way that does not answer your question, but it may help re-assess some assumption about EGL=GPU.

rsantos · February 14, 2024, 3:46pm

Hi Sebastien,

Yes, and the thing: it is using the GPU for sure because when I run the same package (osmesa) on a node without GPU acceleration, it takes more than a minute to completely load the same file (it is a 1GB file that loads a 3D cube).

It is even different between the two GPUs. We also tried to increase number of CPU tasks or reserve more GPUs, but the results were the same, which means the CPU is not really making a difference and more GPUs do not change the loading time.

A colleague from the GPU team was also wondering why it is working because I am not loading any CUDA modules on the system, only running pvserver from a node with GPUs.

Anyway, I will keep searching and doing some tests.

Thanks!

jourdain · February 14, 2024, 10:19pm

Wait, what do you mean by: When running (osmesa) with GPU it is faster?

OSMesa is CPU only and therefore does not rely on the GPU at all. The only thing that could be using the GPU at that point could be some vtk-m filter assuming ParaView has been compiled with cuda for the vtk-m side (which I don’t think that is the case by default but I can be wrong since it is not my area).

Also what are those timing representing? First loading time? Time needed to render each frame?
Are you using the default rendering backend via remote rendering or OSPRay, Index, …?

rsantos · February 16, 2024, 2:08pm

Hi Sebastien,

I mean, when I run with “osmesa” package on a GPU node, it runs much faster than running the same package on a node without GPU. I did some new benchmarks:

1.
Version: 5.10.1-MPI
CPU: Intel Xeon 6138 (20 cores, 40 threads) or Intel Xeon 6230R (26 cores, 52 threads)
GPU: None
Allocation command: `salloc --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --time=00:30:00 --mem=20G --account=nn9999k`
PV Server command: `srun ./pvserver --server-port=7755 --force-offscreen-rendering`
Message: "Display is not accessible on the server side. Remote rendering will be disabled."
Time: 1m57s

2.
Version: 5.10.1-MPI
CPU: Intel Xeon 6126 (12 cores, 24 threads)
GPU: NVIDIA P100
Allocation command: `salloc --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --time=00:30:00 --mem
=20G --partition=accel --gpus=1 --account=nn9999k`
PV Server command: `srun ./pvserver --server-port=7755 --force-offscreen-rendering`
Message: "Display is not accessible on the server side. Remote rendering will be disabled."
Time: 1m46s

3.
Version: 5.10.1-MPI
CPU: AMD EPYC 7542 (32 cores, 64 threads)
GPU: NVIDIA A100
Allocation command: `salloc --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --time=00:30:00 --mem
=20G --partition=a100 --gpus=1 --account=nn9999k`
PV Server command: `srun ./pvserver --server-port=7755 --force-offscreen-rendering`
Message: "Display is not accessible on the server side. Remote rendering will be disabled."
Time: 1m11s

With the MPI package, the time is always more than a minute and the message about rendering disabled appears.
The AMD CPU performs much better than both Xeon CPUs.

4.
Version: 5.10.1-osmesa
CPU: Intel Xeon 6138 (20 cores, 40 threads) or Intel Xeon 6230R (26 cores, 52 threads)
GPU: None
Allocation command: `salloc --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --time=00:30:00 --mem=20G --account=nn9999k`
PV Server command: `srun ./pvserver --server-port=7755 --force-offscreen-rendering`
Message: "None"
Time: 59s

5.
Version: 5.10.1-osmesa
CPU: Intel Xeon 6126 (12 cores, 24 threads)
GPU: NVIDIA P100
Allocation command: `salloc --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --time=00:30:00 --mem
=20G --partition=accel --gpus=1 --account=nn9999k`
PV Server command: `srun ./pvserver --server-port=7755 --force-offscreen-rendering`
Message: "None"
Time: 48s

6.
Version: 5.10.1-osmesa
CPU: AMD EPYC 7542 (32 cores, 64 threads)
GPU: NVIDIA A100
Allocation command: `salloc --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --time=00:30:00 --mem
=20G --partition=a100 --gpus=1 --account=nn9999k`
PV Server command: `srun ./pvserver --server-port=7755 --force-offscreen-rendering`
Message: "None"
Time: 20s

With the OSMESA package, even though the CPUs are the same, having GPUs reserved make a huge difference.
On the first case (MPI package), it performs faster than with the AMD CPU and no GPU
With the P100 and A100, both perform much better too, being the A100 faster (it is also the newer and more powerful that we have available)

7.
Version: 5.10.1-egl
CPU: Intel Xeon 6138 (20 cores, 40 threads) or Intel Xeon 6230R (26 cores, 52 threads)
GPU: None
Allocation command: `salloc --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --time=00:30:00 --mem=20G --account=nn9999k`
PV Server command: `srun ./pvserver --server-port=7755 --force-offscreen-rendering`
Message: "error while loading shared libraries: libOpenGL.so.0: cannot open shared object file: No such file or directory"
Time: did not run

8.
Version: 5.10.1-egl
CPU: Intel Xeon 6126 (12 cores, 24 threads)
GPU: NVIDIA P100
Allocation command: `salloc --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --time=00:30:00 --mem
=20G --partition=accel --gpus=1 --account=nn9999k`
PV Server command: `srun ./pvserver --server-port=7755 --force-offscreen-rendering`
Message: "None"
Time: 47s

9.
Version: 5.10.1-egl
CPU: AMD EPYC 7542 (32 cores, 64 threads)
GPU: NVIDIA A100
Allocation command: `salloc --nodes=1 --ntasks-per-node=1 --cpus-per-task=1 --time=00:30:00 --mem
=20G --partition=a100 --gpus=1 --account=nn9999k`
PV Server command: `srun ./pvserver --server-port=7755 --force-offscreen-rendering`
Message: "None"
Time: 19s

With the EGL package, we do not see too much difference from the OSMESA version. The MPI package did not run and, even though I managed to load the “libOpenGL.so.0”, it gave me error when connecting to PV Server.

In summary, we can see CPU differences on the first category, but GPU really seems to make a difference. Also, I was expecting EGL to be much faster than OSMESA but this is not the case.

As for your questions, I am loading a 1gb file that renders a 3d cube and I measure from the time I click on “Apply” until the cube renders on the screen and I can move with my mouse.
The user that I am supporting provided the file and I don’t have more knowledge about ParaView so I am not sure about the default rendering backend you mentioned.

What I did was basically downloading the “MPI”, “OSMESA” and “EGL” files from ParaView website, unpacked the tar files on the server and ran pvserver like I wrote above. All versions are 5.10.1, including the one I am running on my laptop (and connecting to PV Server) on Windows 11.

I tried to play a bit with ntasks-per-node and cpus-per-task, but it did not make a lot of difference. Also, the flag --force-offscreen-rendering seemed to always make it run slightly faster, but again, a very marginal difference.

Finally, the user complained that he was trying to load a 53GB file and it took 37 minutes before ParaView just crashed and he did not manage to load his file (when he was using the MPI version). Once he started using the OSMESA version like on the benchmark 6, the same file loaded within less than 7 minutes.

Sorry for the long post, I just wanted to provide more context and try to understand why the GPU node makes a difference, especially for updating our documentation.

Best Regards,
Rafael

ben.boeckel · February 16, 2024, 8:59pm

Is it possible to get more detailed logs about what is faster in each case? Frames per second? Triangles per second? I’m not too familiar with available perf tools for these kinds of things, but maybe even perf would give some insight here.

mwestphal · February 19, 2024, 9:06am

Hi @rsantos

Do you see anything in nvidia-smi when using the OSMESA release ?

rsantos · February 20, 2024, 8:54am

Hello,

I will try to look if we have available tools for monitoring the performance, but it is an HPC system and it is a bit bureaucratic to install things so I am not hopeful if the tool is not already available. I am also not sure if this is going to be very useful though, as the simulation runs very fast and I have nothing much to do with it, only flip the 3d cube and zoom-in and zoom-out.

I did see a higher memory utilization when using the A100 GPU than the P100 GPU. When first loading the simulation, the P100 used 3 MiB of memory, while A100 used 78 MiB. Then, after spinning the cube a bit, the memory utilization went to 84MiB and 83MiB respectively.

rsantos · February 20, 2024, 9:21am

I ran on P100 and A100 with the “osmesa” package and “egl” package. Only the “egl” showed up on nvidia-smi.

Apparently, “osmesa” really does not benefit from the GPU and the difference between P100 and A100 nodes could be due to the different CPU.

P100 with Intel Xeon and osmesa: 48s
P100 with Intel Xeon and egl: 47s

A100 with AMD EPYC and osmesa: 20s
A100 with AMD EPYC and egl: 19s

However, is it plausible that the difference between using the CPU and the GPU is so little? I mean, 1s difference could even be within the margin of error and so the simulation does not seem to benefit from the GPU, even showing up on nvidia-smi.

I also increased the number of P100 GPUs when allocating resources and run the “osmesa” and “egl” versions. The times were 48s for the first (so, exactly the same) and 45s for the second (slightly faster but again, too little difference).

When I tried to increase the number of A100 GPUs, once I allocated more than 1 and tried to connect with pvserver, it threw me lots of errors and ParaView closed.

I am not a specialist in the subject, but this is probably related to the new architecture (AMD) and the libOpenGL.so.0 not being available (I have to install libglvnd with EasyBuild and load the module manually: EasyBuild - building software with ease

Hope it is helpful. Thanks!

mclarsen · February 20, 2024, 4:19pm

Drive by comment: if the majority of the time is spent loading data, both versions of paraview don’t use the GPU to load data. Perhaps the 1s difference is the rendering time, and the rest of the workload is dominated by the CPU code.

rsantos · February 21, 2024, 10:15am

Your comment is must probably right. I am asking the original user I am supporting to perform some tests and see if there are any differences with bigger simulations.

I also could bypass the error I mentioned by running (with the EGL package):

salloc --nodes=1 --ntasks-per-node=2 --cpus-per-task=1 --time=00:30:00 --mem=20G --partition=a100 --gpus=2

and

CUDA_VISIBLE_DEVICES=0 srun --ntasks=1 --gres=gpu:1 ./pvserver --force-offscreen-rendering &
CUDA_VISIBLE_DEVICES=1 srun --ntasks=1 --gres=gpu:1 ./pvserver --force-offscreen-rendering &

I will report back if he sees any improvements with bigger simulations.

I then change my initial question: apart from different CPUs which contribute to faster loading times, there’s also something different on the packages ParaView-5.10.1-MPI-Linux-Python3.9-x86_64.tar.gz and ParaView-5.10.1-osmesa-MPI-Linux-Python3.9-x86_64.tar.gz .

What does “osmesa” has different that drastically help the loading of the simulation? Searching a bit, I just found answers more related to the difference between “osmesa” and “egl”, which is basically software and gpu rendering, respectively.

Thanks again!