Linux X issues


(Shawn Waldon) #21

I have noticed that vall seems to slow down as more builds run… I’ll check the memory next time it breaks. What command did you use to get that table?


(Mathieu Westphal) #22

free -m


(Cory Quammen (Kitware)) #23

So… something on the system is leaking memory? The graphics driver?


(Mathieu Westphal) #24

could be, next time i see it i will stop service and modules until it drops.


(Cory Quammen (Kitware)) #25

What do you think about reverting the change to the new Qt widget to see if that helps the dashboards stay alive? Right now we can’t trust anything vall or luigi report, and they are taking a long time to build, so other development is being slowed down.


(Mathieu Westphal) #26

Reverting would not be necessary, we could just change a single line instead.

However I’m afraid that doing that will prevent us correcting the issue, since it requires multiple run before showing up.


(Cory Quammen (Kitware)) #27

However I’m afraid that doing that will prevent us correcting the issue, since it requires multiple run before showing up.

Okay.

It looks like Luigi is experiencing difficulties now, so next time you can access it, could you try stopping services and modules?


(Mathieu Westphal) #28

It was indeed the nvidia module eating up all the memory and probably causing the issues.
Easy to temporarily fix with the following :

sudo service lightdm stop
sudo rmmod nvidia_uvm
sudo rmmod nvidia
sudo modprobe nvidia
sudo modprobe nvidia_uvm
sudo service lightdm start

we will investigate where it is coming from. In the meantime i will ensure luigi does not get clogged up and other merge request can run.


(Mathieu Westphal) #29

I’m now monitoring memory usage on luigi with a simple script.

Tylo seems to be experiencing issues that may be related, with errors as follow :

ERROR: In C:\dashboards\buildbot\2000d565\source\VTK\Rendering\OpenGL2\vtkShaderProgram.cxx, line 491
vtkSMTestDriver: ***** Test will fail, because the string: "ERROR:"
vtkSMTestDriver: ***** was found in the following output from the client:
"ERROR: In C:\dashboards\buildbot\2000d565\source\VTK\Rendering\OpenGL2\vtkShaderProgram.cxx, line 491"
vtkShaderProgram (000000000B4EE7E0): Shader object was not initialized, cannot attach it.

Generic Warning: In C:\dashboards\buildbot\2000d565\source\VTK\Rendering\OpenGL2\vtkOpenGLRenderTimer.cxx, line 124
vtkOpenGLRenderTimer::Stop called before vtkOpenGLRenderTimer::Start. Ignoring.

ERROR: In C:\dashboards\buildbot\2000d565\source\VTK\Rendering\OpenGL2\vtkOpenGLVertexArrayObject.cxx, line 311
vtkOpenGLVertexArrayObject (000000001650DD50): attempt to add attribute when not ready for attribute ndCoordIn

Generic Warning: In C:\dashboards\buildbot\2000d565\source\VTK\Rendering\OpenGL2\vtkOpenGLQuadHelper.cxx, line 60
Error binding ndCoords to VAO.

Generic Warning: In C:\dashboards\buildbot\2000d565\source\VTK\Rendering\OpenGL2\vtkOpenGLRenderTimer.cxx, line 124
vtkOpenGLRenderTimer::Stop called before vtkOpenGLRenderTimer::Start. Ignoring.

Generic Warning: In C:\dashboards\buildbot\2000d565\source\VTK\Rendering\OpenGL2\vtkOpenGLRenderTimer.cxx, line 124
vtkOpenGLRenderTimer::Stop called before vtkOpenGLRenderTimer::Start. Ignoring.

Generic Warning: In C:\dashboards\buildbot\2000d565\source\VTK\Rendering\OpenGL2\vtkOpenGLRenderTimer.cxx, line 124
vtkOpenGLRenderTimer::Stop called before vtkOpenGLRenderTimer::Start. Ignoring.

ERROR: In C:\dashboards\buildbot\2000d565\source\VTK\Rendering\OpenGL2\vtkShaderProgram.cxx, line 491
vtkShaderProgram (0000000016304B50): Shader object was not initialized, cannot attach it.

ERROR: In C:\dashboards\buildbot\2000d565\source\VTK\Rendering\OpenGL2\vtkShaderProgram.cxx, line 446
vtkShaderProgram (0000000016304B50): 1: #version 150
2: #ifndef GL_ES
3: #define highp
4: #define mediump
5: #define lowp
6: #endif // GL_ES
7: #define attribute in
8: #define varying out
9: 
10: in vec2 vertexMC;
11: uniform mat4 WCDCMatrix;
12: uniform mat4 MCWCMatrix;
13: #ifdef haveColors
14: in vec4 vertexScalar;
15: out vec4 vertexColor;
16: #endif
17: #ifdef haveTCoords
18: in vec2 tcoordMC;
19: out vec2 tcoord;
20: #endif
21: #ifdef haveLines
22: in vec2 tcoordMC;
23: out float ldistance;
24: #endif
25: void main() {
26: #ifdef haveColors
27: vertexColor = vertexScalar;
28: #endif
29: #ifdef haveTCoords
30: tcoord = tcoordMC;
31: #endif
32: #ifdef haveLines
33: ldistance = tcoordMC.x;
34: #endif
35: vec4 vertex = vec4(vertexMC.xy, 0.0, 1.0);
36: gl_Position = vertex*MCWCMatrix*WCDCMatrix; }

It is to be noted that vtkOpenGLRenderTimer has some memory errors according to valgrind.

[glow@violet ~/work/paraview/paraviewThird/paraview_build]$ optirun valgrind ./bin/paraview
==13686== Memcheck, a memory error detector
==13686== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==13686== Using Valgrind-3.13.0 and LibVEX; rerun with -h for copyright info
==13686== Command: ./bin/paraview
==13686== 
--13686-- WARNING: unhandled amd64-linux syscall: 332
--13686-- You may be able to write your own handler.
--13686-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
--13686-- Nevertheless we consider this a bug.  Please report
--13686-- it at http://valgrind.org/support/bug_reports.html.
==13686== Conditional jump or move depends on uninitialised value(s)
==13686==    at 0x473E2D1E: ??? (in /usr/lib/libnvidia-glcore.so.390.48)
==13686==    by 0x473E4AA1: ??? (in /usr/lib/libnvidia-glcore.so.390.48)
==13686==    by 0x46EFDCA3: ??? (in /usr/lib/libnvidia-glcore.so.390.48)
==13686==    by 0xE933705: vtkOpenGLRenderTimer::Start() (vtkOpenGLRenderTimer.cxx:101)
==13686==    by 0xE8A074A: vtkOpenGLFXAAFilter::StartTimeQuery(vtkOpenGLRenderTimer*) (vtkOpenGLFXAAFilter.cxx:422)
==13686==    by 0xE89EA52: vtkOpenGLFXAAFilter::Execute(vtkOpenGLRenderer*) (vtkOpenGLFXAAFilter.cxx:103)
==13686==    by 0xE94B724: vtkOpenGLRenderer::UpdateGeometry() (vtkOpenGLRenderer.cxx:339)
==13686==    by 0xE94AAB0: vtkOpenGLRenderer::DeviceRender() (vtkOpenGLRenderer.cxx:247)
==13686==    by 0x165E0B75: vtkRenderer::Render() (vtkRenderer.cxx:351)
==13686==    by 0x165DE37B: vtkRendererCollection::Render() (vtkRendererCollection.cxx:51)
==13686==    by 0x165FC952: vtkRenderWindow::DoStereoRender() (vtkRenderWindow.cxx:330)
==13686==    by 0x165FC71D: vtkRenderWindow::Render() (vtkRenderWindow.cxx:291)
==13686== 
==13686== Conditional jump or move depends on uninitialised value(s)
==13686==    at 0x473E39B4: ??? (in /usr/lib/libnvidia-glcore.so.390.48)
==13686==    by 0x473E3EC2: ??? (in /usr/lib/libnvidia-glcore.so.390.48)
==13686==    by 0x46EFDCA3: ??? (in /usr/lib/libnvidia-glcore.so.390.48)
==13686==    by 0xE9338D4: vtkOpenGLRenderTimer::Stop() (vtkOpenGLRenderTimer.cxx:129)
==13686==    by 0xE8A077C: vtkOpenGLFXAAFilter::EndTimeQuery(vtkOpenGLRenderTimer*) (vtkOpenGLFXAAFilter.cxx:433)
==13686==    by 0xE89EA81: vtkOpenGLFXAAFilter::Execute(vtkOpenGLRenderer*) (vtkOpenGLFXAAFilter.cxx:106)
==13686==    by 0xE94B724: vtkOpenGLRenderer::UpdateGeometry() (vtkOpenGLRenderer.cxx:339)
==13686==    by 0xE94AAB0: vtkOpenGLRenderer::DeviceRender() (vtkOpenGLRenderer.cxx:247)
==13686==    by 0x165E0B75: vtkRenderer::Render() (vtkRenderer.cxx:351)
==13686==    by 0x165DE37B: vtkRendererCollection::Render() (vtkRendererCollection.cxx:51)
==13686==    by 0x165FC952: vtkRenderWindow::DoStereoRender() (vtkRenderWindow.cxx:330)
==13686==    by 0x165FC71D: vtkRenderWindow::Render() (vtkRenderWindow.cxx:291)
==13686== 
==13686== Conditional jump or move depends on uninitialised value(s)
==13686==    at 0x473E2D1E: ??? (in /usr/lib/libnvidia-glcore.so.390.48)
==13686==    by 0x473E4AA1: ??? (in /usr/lib/libnvidia-glcore.so.390.48)
==13686==    by 0x46EFDCA3: ??? (in /usr/lib/libnvidia-glcore.so.390.48)
==13686==    by 0xE9338D4: vtkOpenGLRenderTimer::Stop() (vtkOpenGLRenderTimer.cxx:129)
==13686==    by 0xE8A077C: vtkOpenGLFXAAFilter::EndTimeQuery(vtkOpenGLRenderTimer*) (vtkOpenGLFXAAFilter.cxx:433)
==13686==    by 0xE89EA81: vtkOpenGLFXAAFilter::Execute(vtkOpenGLRenderer*) (vtkOpenGLFXAAFilter.cxx:106)
==13686==    by 0xE94B724: vtkOpenGLRenderer::UpdateGeometry() (vtkOpenGLRenderer.cxx:339)
==13686==    by 0xE94AAB0: vtkOpenGLRenderer::DeviceRender() (vtkOpenGLRenderer.cxx:247)
==13686==    by 0x165E0B75: vtkRenderer::Render() (vtkRenderer.cxx:351)
==13686==    by 0x165DE37B: vtkRendererCollection::Render() (vtkRendererCollection.cxx:51)
==13686==    by 0x165FC952: vtkRenderWindow::DoStereoRender() (vtkRenderWindow.cxx:330)
==13686==    by 0x165FC71D: vtkRenderWindow::Render() (vtkRenderWindow.cxx:291)

I’m trying to tag TJ Corona so he can take a look into tylo and let us know about current memory usage.


(Shawn Waldon) #30

I think tylo’s issue has to do with the power outage last night. It is not on a UPS and it doesn’t handle reboots well.


(Mathieu Westphal) #31

Ok, thanks for the info.

I’ve characterized the problem.

There is one missing run at the beggining, but it has the same profile.
During build, no problem at all
During testing, GPU VRAM leak A LOT !, RAM actually does not leak.
During testing, once GPU VRAM is full, RAM starts to loak
(i used nvidia-smi and free to generate this data, please ask if you need the script)

So we consider the source of the problem to be a leak in the VRAM, so a leak in our usage of OpenGL.

We identified how to reproduce the leak with current ParaView master :

  1. Open ParaView
  2. Split the view, create a render view
  3. Destroy the new view
  4. Close ParaView

nvidia-smi before and after this usally shows 12Mb of VRAM leaked.

Step 3 is very important as :

  • GPU VRAM usage jumps suspiciously when closing (around 20Mb)
  • No leaks happens without it if closing directly ParaView

We tried tracing OpenGL calls with apitrace, but it does not report any leaks.

A careful examination of the new Widget/Window code does not yield any results yet.

We will keep investigating.


(Utkarsh Ayachit) #32

BTW, I noticed that the cursor no longer changes to cross-hair when I end selection mode to make selection (on Linux, at least). Is that a new QVTKOpenGLWidget related issue too?


(Mathieu Westphal) #33

Yes it is. I’m adding it to the ever growing pile.


(Utkarsh Ayachit) #34

Sorry :). Such changes are often fraught withsuch smallish (and some biggish) issues. I am just glad we didn’t rush these changes in for 5.5. It’s always good to give ourselves plenty of time to stabilize.


(Mathieu Westphal) #35

Important news guys. It is not the driver leaking VRAM, but it probably is what @martink suggested by mail about shared OpenGL ressources between processes.

I say that because restarting Xorg seems free all the leaked memory, without the need for unloading the driver module as I stated previously

sudo service lightdm stop
sudo service ligthdm start

We do see that memory used by Xorg increasing over time. Just killing Xorg is also a fix

sudo kill -9 xorg_pid

(Mathieu Westphal) #36

Yet another breakthrough !

The leak is fixed by disabling effects (I supose compositing being the main one).

@shawn.waldon : could you confirm that on vall please ?


(Mathieu Westphal) #37

Ok Guys, we have a bit of a Problem here.
With the fix from @Joachim_Pouderoux, most issues are gone.

However, we still experience leaks on some computers and not on others!

Anyone with access to a linux with nvidia driver with a ParaView master built can help us identify more clearly the issue:

  • Enable compositing if you have disabled it.
  • Build ParaView master with VTK master (at least ff29c3dcc03650347168ccf82a4344a4521143a4) and testing.
  • nvidia-smi : note the global memory usage
  • ctest -R pv.UndoRedo2 --repeat-until-fail 20
  • nvidia-smi : check the global memory usage and see if it has gone up

Please response to this message and include

  • OS and OS Version (please include if it was installed in the current version or upgraded)
  • Graphic Card model
  • nvidia driver version
  • Kernel version
  • X -version
  • Qt version
  • Window manager and Desktop environnement used and version
  • ParaView’s git tag
  • Of course, the memory results

@shawn.waldon @cory.quammen @utkarsh.ayachit

Lots of thanks !


(Cory Quammen (Kitware)) #38

ParaView: 3494fe78
VTK: 437188aa

Memory before test: 1856MiB
/usr/lib/xorg/Xorg 1127MiB
compiz 388MiB
-token=… 330MiB
Memory after test: 2011MiB
/usr/lib/xorg/Xorg 1229MiB
compiz 441MiB
-token=… 330MiB

System specs

  • Ubuntu 16.04
  • Quadro M4000
  • Driver: 384.90
  • Kernel: #46~16.04.1-Ubuntu SMP Mon Dec 4 15:57:59 UTC 2017
  • X.Org X Server: 1.19.5
  • Qt: 5.9.1
  • Compiz 0.9.12.2, Unity 7.4.0

Some more data: each time I run ctest -R pv.UndoRedo2 --repeat-until-fail 20, Xorg increases memory usage by about 110MB. I’m seeing that growth consistently after 4 invocations of this command.


(Shawn Waldon) #39

This is my local machine, not a dashboard.

Xubuntu 18.04, installed in its current version
Quadro K620
Driver version 390.48
Kernel 4.15.0-23
Xorg: 1.19.6
Qt: 5.10.1 (from Qt5 website)
Xfce 4.12

Memory usage from nvidia-smi:
Before:
Global: 228MiB / 1993MiB
/usr/lib/xorg/Xorg 180MiB
/usr/lib/firefox/firefox 43MiB

After
Global: 270MiB / 1993MiB
/usr/lib/xorg/Xorg 222MiB
/usr/lib/firefox/firefox 43MiB


(Joachim Pouderoux) #40

@cory.quammen @shawn.waldon Just to be sure. Did you performed this test on master or after applying my MR vtk!4431 or Mathieu’s paraview!2511?