Linux X issues


(Utkarsh Ayachit) #1

X-based dashboards are running into issues quite regularly of late. Is this related to the QVTKOpenGLWidget changes?


(Mathieu Westphal) #2

The luigi issues are unrelated, as VTK was broken as well.

The vall issues looks like something we had during dev, but @shawn.waldon fixed it by reinstalling the nvidia driver as far as I know.


(Shawn Waldon) #3

The drivers look fine right now as far as I can tell. I’ve unpaused the machine and we can see if the failures continue. I had to reboot the machine Friday since X was messed up. It was fine for a few builds and then these errors started to crop up again. At some point it got paused, but I’ve unpaused it and we’ll see how it handles the queue.

What Utkarsh is saying is that since the merge of the new widget the “X is messed up” type of dashboard error seems to be more common on both vall and luigi. Once X is broken neither VTK nor ParaView will work, but is the ParaView dashboard causing X to break?


(Mathieu Westphal) #4

Good question.
I will monitor luigi to see if the X breaking i see are related to the testing.


(Shawn Waldon) #5

ParaView builds have been going for a while. I happened to look over and see a “Your OpenGL drivers don’t support the features required for basic rendering” on one of the ParaView tests.


(Shawn Waldon) #6

So the pattern I see over the last day is that the tests were fine, then a few started failing on each build and finally everything failed. You can see this on my preset dialog branch if you look at the history on buildbot (the last 3 were all fine on other dashboards and locally).

Looking into the first build that failed and broke the streak of good builds on vall I tried to identify the first test failure that didn’t look related to the branch. Here is the build, and here is (according to buildbot) the first (temporally) test that failed that looks like a machine issue: https://open.cdash.org/testDetails.php?test=660266841&build=5406847

Test: pvcs.LagrangianParticleTrackerParallel
Output:

"ERROR: In /home/kitware/dashboards/buildbot/paraview-vall-linux-shared-debug_doc_extdeps_gui_mpi_python_python3/source/VTK/Rendering/OpenGL2/vtkXOpenGLRenderWindow.cxx, line 796"
vtkXOpenGLRenderWindow (0x2a70ff0): failed to create offscreen window

ERROR: In /home/kitware/dashboards/buildbot/paraview-vall-linux-shared-debug_doc_extdeps_gui_mpi_python_python3/source/VTK/Rendering/OpenGL2/vtkOpenGLRenderWindow.cxx, line 725
vtkXOpenGLRenderWindow (0x2a70ff0): GLEW could not be initialized.

ERROR: In /home/kitware/dashboards/buildbot/paraview-vall-linux-shared-debug_doc_extdeps_gui_mpi_python_python3/source/VTK/Rendering/OpenGL2/vtkXOpenGLRenderWindow.cxx, line 796
vtkXOpenGLRenderWindow (0x2579590): failed to create offscreen window

ERROR: In /home/kitware/dashboards/buildbot/paraview-vall-linux-shared-debug_doc_extdeps_gui_mpi_python_python3/source/VTK/Rendering/OpenGL2/vtkOpenGLRenderWindow.cxx, line 725
vtkXOpenGLRenderWindow (0x2579590): GLEW could not be initialized.

(Mathieu Westphal) #7

I’m seeing a simillar behavior on luigi.

serial tests are fine.
parallel test start failing, especially pvcs tests, then all tests fails with a segfault.

Somehow they sometimes recover.

I was able to get the segfault in gdb, here is the back trace :

Thread 1 "paraview" received signal SIGABRT, Aborted.
0x00007ffff3d6c428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff3d6c428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff3d6e02a in __GI_abort () at abort.c:89
#2  0x00007ffff3dae7ea in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7ffff3ec7ed8 "*** Error in `%s': %s: 0x%s ***\n") at ../sysdeps/posix/libc_fatal.c:175
#3  0x00007ffff3db737a in malloc_printerr (ar_ptr=<optimized out>, ptr=<optimized out>, str=0x7ffff3ec8008 "double free or corruption (!prev)", action=3) at malloc.c:5006
#4  _int_free (av=<optimized out>, p=<optimized out>, have_lock=0) at malloc.c:3867
#5  0x00007ffff3dbb53c in __GI___libc_free (mem=<optimized out>) at malloc.c:2968
#6  0x00007fffc095bcbc in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
#7  0x00007fffc095bd03 in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
#8  0x00007fffc095bfa4 in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
#9  0x00007fffc084e650 in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
#10 0x00007fffc082881c in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
#11 0x00007fffc07276ad in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
#12 0x00007fffc07275a8 in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
#13 0x00007fffc1d945c2 in ?? () from /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
#14 0x00007fffc1d96632 in ?? () from /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
#15 0x00007fffc1d94b52 in clGetPlatformIDs () from /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
#16 0x00007fffc1f992f5 in ?? () from /usr/lib/x86_64-linux-gnu/hwloc/hwloc_opencl.so
#17 0x00007fffd0f7f011 in hwloc_backends_notify_new_object () from /usr/lib/x86_64-linux-gnu/libhwloc.so.5
#18 0x00007fffd0f831fc in ?? () from /usr/lib/x86_64-linux-gnu/libhwloc.so.5
#19 0x00007fffd0f832b9 in hwloc_insert_pci_device_list () from /usr/lib/x86_64-linux-gnu/libhwloc.so.5
#20 0x00007fffc1b8f9a5 in ?? () from /usr/lib/x86_64-linux-gnu/hwloc/hwloc_pci.so
#21 0x00007fffd0f7923d in hwloc_topology_load () from /usr/lib/x86_64-linux-gnu/libhwloc.so.5
#22 0x00007fffd5bc33d3 in opal_hwloc_base_get_topology () from /usr/lib/libopen-pal.so.13
#23 0x00007fffea6b4e65 in ompi_mpi_init () from /usr/lib/libmpi.so.12
#24 0x00007fffea6d354d in PMPI_Init () from /usr/lib/libmpi.so.12
#25 0x00007ffff6057fb1 in vtkProcessModule::Initialize(vtkProcessModule::ProcessTypes, int&, char**&) () from /home/kitware/Dashboards/buildslave/paraview-luigi-linux-shared-release_gui_mpi_python/build/lib/libvtkPVClientServerCoreCore-pv5.5.so.1
#26 0x00007ffff2b6387a in vtkInitializationHelper::Initialize(int, char**, int, vtkPVOptions*) () from /home/kitware/Dashboards/buildslave/paraview-luigi-linux-shared-release_gui_mpi_python/build/lib/libvtkPVServerManagerApplication-pv5.5.so.1
#27 0x00007ffff6c1d9df in pqApplicationCore::pqApplicationCore(int&, char**, pqOptions*, QObject*) () from /home/kitware/Dashboards/buildslave/paraview-luigi-linux-shared-release_gui_mpi_python/build/lib/libvtkpqCore-pv5.5.so.1
#28 0x00007ffff7a8eb83 in pqPVApplicationCore::pqPVApplicationCore(int&, char**, pqOptions*) () from /home/kitware/Dashboards/buildslave/paraview-luigi-linux-shared-release_gui_mpi_python/build/lib/libvtkpqApplicationComponents-pv5.5.so.1
#29 0x000000000040961d in pqparaviewInitializer::Initialize(int, char**) ()
#30 0x0000000000409220 in main ()
(gdb) 

This seems to be related to mpi and nvidia. but this trace is from a non parallel test.
I will try to get more info.


(Mathieu Westphal) #8

After a full update and a switch to ubuntu provided nvidia driver, I do not see the issue. I will keep monitoring.


(Cory Quammen (Kitware)) #9

Thanks, @mwestphal. The trace definitely suggested a problem related to the graphics driver.


(Mathieu Westphal) #10

After a night of testing, I have seen the same issue showed up. I will try to investigate more.


(Mathieu Westphal) #11

@Joachim_Pouderoux


(Mathieu Westphal) #12

HI guys !

I have tested a branch that corrects a few memory errors that i have found with valgrind, but vall still seems to be not happy with it, we will see how luigi handles it.

https://gitlab.kitware.com/paraview/paraview/merge_requests/2511

I’m a bit out of my depth here. The fact that this issue is so elusive and only shows up in specific situations does not help much.

I have no idea how to make this move forward, could you take it from here ?


(Mathieu Westphal) #13

Here is the all the information I have gathered :

  • After starting up the buildbot, no error
  • After a few builtbot runs, between 4 to 10, nvidia driver error seems to appear
  • error can be “GLEW could not be initialized” or the segfault mentionned earlier
  • changing version of nvidia driver do not effect the bug (tested 384, 390, 340)
  • changing version of Qt does not effect the bug (tested 5.9, 5.11)
  • rebooting corrects the bug until it come back
  • I could not test nouveau driver
  • I did not test without MPI yet.

(Utkarsh Ayachit) #14

Another info point: the issue is indeed a “master” only issue. I had to run a couple of buildbots based off the release branch and all dashboards seemed totally happy:

e.g.

  1. https://gitlab.kitware.com/paraview/paraview/merge_requests/2513
  2. https://gitlab.kitware.com/paraview/paraview/merge_requests/2514

(Mathieu Westphal) #15

Indeed, as release does not contains the new QOpenGLWindow based code.


(Utkarsh Ayachit) #16

Here’s some interesting call-stack when quitting the application (this is from a custom app). I see a whole lot of makecurrent calls when quitting this custom app and it takes quite sometime to exit. I wonder if this is a clue.

....
#12 0x00007fffd9c91a51 in ?? () from /usr/lib/nvidia-384/libnvidia-glcore.so.384.130
#13 0x00007fffd9c99c78 in ?? () from /usr/lib/nvidia-384/libnvidia-glcore.so.384.130
#14 0x00007fffeab8b140 in ?? () from /usr/lib/nvidia-384/libGL.so.1
#15 0x00007fffeab89352 in ?? () from /usr/lib/nvidia-384/libGL.so.1
#16 0x00007fffeabbcf73 in ?? () from /usr/lib/nvidia-384/libGL.so.1
#17 0x00007ffff7fe69b9 in QGLXContext::makeCurrent (this=0x2ff6ae0, surface=0x40a7fc0) at qglxintegration.cpp:501
#18 0x00007ffff66fa3a2 in QOpenGLContext::makeCurrent (this=0x2080140, surface=0x4997260) at kernel/qopenglcontext.cpp:986
#19 0x00007ffff4131706 in QVTKOpenGLWindow::MakeCurrent (this=0x27179e0) at /home/utkarsh/Kitware/ParaView3/ParaView/VTK/GUISupport/Qt/QVTKOpenGLWindow.cxx:227
#20 0x00007ffff41341d3 in QVTKOpenGLWindow::qt_static_metacall (_o=0x27179e0, _c=QMetaObject::InvokeMetaMethod, _id=1, _a=0x7fffffffd620) at VTK/GUISupport/Qt/moc_QVTKOpenGLWindow.cpp:121
#21 0x00007ffff62d9c47 in QMetaObject::activate (sender=sender@entry=0x2725720, signalOffset=<optimized out>, local_signal_index=local_signal_index@entry=0, argv=argv@entry=0x7fffffffd620)
    at kernel/qobject.cpp:3767
#22 0x00007ffff62d9f7f in QMetaObject::activate (sender=0x2725720, m=<optimized out>, local_signal_index=0, argv=0x7fffffffd620) at kernel/qobject.cpp:3629
#23 0x00007ffff4133104 in vtkQtConnection::EmitExecute (this=0x2725720, _t1=0x264eb10, _t2=89, _t3=0x0, _t4=0x0, _t5=0x27256d0) at VTK/GUISupport/Qt/moc_vtkQtConnection.cpp:140
#24 0x00007ffff411a00f in vtkQtConnection::Execute (this=0x2725720, caller=0x264eb10, e=89, call_data=0x0) at /home/utkarsh/Kitware/ParaView3/ParaView/VTK/GUISupport/Qt/vtkQtConnection.cxx:72
#25 0x00007ffff4119f9e in vtkQtConnection::DoCallback (vtk_obj=0x264eb10, event=89, client_data=0x2725720, call_data=0x0) at /home/utkarsh/Kitware/ParaView3/ParaView/VTK/GUISupport/Qt/vtkQtConnection.cxx:62
#26 0x00007ffff20d46bb in vtkCallbackCommand::Execute (this=0x27256d0, caller=0x264eb10, event=89, callData=0x0) at /home/utkarsh/Kitware/ParaView3/ParaView/VTK/Common/Core/vtkCallbackCommand.cxx:42
#27 0x00007ffff23bbc87 in vtkSubjectHelper::InvokeEvent (this=0x23bb9b0, event=89, callData=0x0, self=0x264eb10) at /home/utkarsh/Kitware/ParaView3/ParaView/VTK/Common/Core/vtkObject.cxx:616
#28 0x00007ffff23bc1ab in vtkObject::InvokeEvent (this=0x264eb10, event=89, callData=0x0) at /home/utkarsh/Kitware/ParaView3/ParaView/VTK/Common/Core/vtkObject.cxx:785
#29 0x00007fffecb51888 in vtkGenericOpenGLRenderWindow::MakeCurrent (this=0x264eb10) at /home/utkarsh/Kitware/ParaView3/ParaView/VTK/Rendering/OpenGL2/vtkGenericOpenGLRenderWindow.cxx:106
#30 0x00007fffecb9d69e in vtkOpenGLHelper::ReleaseGraphicsResources (this=0x34a6bf8, win=0x264eb10) at /home/utkarsh/Kitware/ParaView3/ParaView/VTK/Rendering/OpenGL2/vtkOpenGLHelper.cxx:43
#31 0x00007fffecbd3279 in vtkOpenGLPolyDataMapper::ReleaseGraphicsResources (this=0x34a6980, win=0x264eb10) at /home/utkarsh/Kitware/ParaView3/ParaView/VTK/Rendering/OpenGL2/vtkOpenGLPolyDataMapper.cxx:196
#32 0x00007fffecbf1b45 in vtkOpenGLResourceFreeCallback<vtkOpenGLPolyDataMapper>::Release (this=0x34a86c0) at /home/utkarsh/Kitware/ParaView3/ParaView/VTK/Rendering/OpenGL2/vtkOpenGLResourceFreeCallback.h:82
#33 0x00007fffecc107e7 in vtkOpenGLRenderWindow::ReleaseGraphicsResources (this=0x264eb10, renWin=0x264eb10) at /home/utkarsh/Kitware/ParaView3/ParaView/VTK/Rendering/OpenGL2/vtkOpenGLRenderWindow.cxx:313
#34 0x00007fffecb5183c in vtkGenericOpenGLRenderWindow::Finalize (this=0x264eb10) at /home/utkarsh/Kitware/ParaView3/ParaView/VTK/Rendering/OpenGL2/vtkGenericOpenGLRenderWindow.cxx:96
#35 0x00007ffff413072b in QVTKOpenGLWindow::SetRenderWindow (this=0x27179e0, w=0x0) at /home/utkarsh/Kitware/ParaView3/ParaView/VTK/GUISupport/Qt/QVTKOpenGLWindow.cxx:74
#36 0x00007ffff413055f in QVTKOpenGLWindow::~QVTKOpenGLWindow (this=0x27179e0, __in_chrg=<optimized out>) at /home/utkarsh/Kitware/ParaView3/ParaView/VTK/GUISupport/Qt/QVTKOpenGLWindow.cxx:46
#37 0x00007ffff41305cc in QVTKOpenGLWindow::~QVTKOpenGLWindow (this=0x27179e0, __in_chrg=<optimized out>) at /home/utkarsh/Kitware/ParaView3/ParaView/VTK/GUISupport/Qt/QVTKOpenGLWindow.cxx:47
#38 0x00007ffff6d27007 in QWindowContainer::~QWindowContainer (this=0x2704d70, __in_chrg=<optimized out>) at kernel/qwindowcontainer.cpp:256
#39 0x00007ffff6d27029 in QWindowContainer::~QWindowContainer (this=0x2704d70, __in_chrg=<optimized out>) at kernel/qwindowcontainer.cpp:257
#40 0x00007ffff62dfdc0 in QObjectPrivate::deleteChildren (this=this@entry=0x2704e30) at kernel/qobject.cpp:1993
#41 0x00007ffff6d02e53 in QWidget::~QWidget (this=0x2721850, __in_chrg=<optimized out>) at kernel/qwidget.cpp:1703
#42 0x00007ffff412f8d8 in QVTKOpenGLWidget::~QVTKOpenGLWidget (this=0x2721850, __in_chrg=<optimized out>) at /home/utkarsh/Kitware/ParaView3/ParaView/VTK/GUISupport/Qt/QVTKOpenGLWidget.cxx:84
#43 0x00007ffff73621fa in pqQVTKWidget::~pqQVTKWidget (this=0x2721850, __in_chrg=<optimized out>) at /home/utkarsh/Kitware/ParaView3/ParaView/Qt/Core/pqQVTKWidget.cxx:76
#44 0x00007ffff7362232 in pqQVTKWidget::~pqQVTKWidget (this=0x2721850, __in_chrg=<optimized out>) at /home/utkarsh/Kitware/ParaView3/ParaView/Qt/Core/pqQVTKWidget.cxx:78
#45 0x00007ffff62dfdc0 in QObjectPrivate::deleteChildren (this=this@entry=0x1d83220) at kernel/qobject.cpp:1993
#46 0x00007ffff6d02e53 in QWidget::~QWidget (this=0x1d88d10, __in_chrg=<optimized out>) at kernel/qwidget.cpp:1703
#47 0x00007ffff6dbe272 in QFrame::~QFrame (this=<optimized out>, __in_chrg=<optimized out>) at widgets/qframe.cpp:262
#48 0x00007ffff6dbe287 in QFrame::~QFrame (this=0x1d88d10, __in_chrg=<optimized out>) at widgets/qframe.cpp:264
#49 0x00007ffff62dfdc0 in QObjectPrivate::deleteChildren (this=this@entry=0x22ffb20) at kernel/qobject.cpp:1993
#50 0x00007ffff6d02e53 in QWidget::~QWidget (this=0x2721070, __in_chrg=<optimized out>) at kernel/qwidget.cpp:1703
#51 0x00007ffff466c886 in pqViewFrame::~pqViewFrame (this=0x2721070, __in_chrg=<optimized out>) at /home/utkarsh/Kitware/ParaView3/ParaView/Qt/Components/pqViewFrame.cxx:122
#52 0x00007ffff466c8be in pqViewFrame::~pqViewFrame (this=0x2721070, __in_chrg=<optimized out>) at /home/utkarsh/Kitware/ParaView3/ParaView/Qt/Components/pqViewFrame.cxx:124
#53 0x00007ffff62dfdc0 in QObjectPrivate::deleteChildren (this=this@entry=0xf9b3d0) at kernel/qobject.cpp:1993
#54 0x00007ffff6d02e53 in QWidget::~QWidget (this=0x1d7ab80, __in_chrg=<optimized out>) at kernel/qwidget.cpp:1703
#55 0x00007ffff4582dde in pqMultiViewWidget::~pqMultiViewWidget (this=0x1d7ab80, __in_chrg=<optimized out>) at /home/utkarsh/Kitware/ParaView3/ParaView/Qt/Components/pqMultiViewWidget.cxx:232
#56 0x00007ffff4582e1c in pqMultiViewWidget::~pqMultiViewWidget (this=0x1d7ab80, __in_chrg=<optimized out>) at /home/utkarsh/Kitware/ParaView3/ParaView/Qt/Components/pqMultiViewWidget.cxx:236
#57 0x00007ffff62dfdc0 in QObjectPrivate::deleteChildren (this=this@entry=0xca6b60) at kernel/qobject.cpp:1993
#58 0x00007ffff6d02e53 in QWidget::~QWidget (this=0xca6b20, __in_chrg=<optimized out>) at kernel/qwidget.cpp:1703
#59 0x00007ffff6dbe272 in QFrame::~QFrame (this=<optimized out>, __in_chrg=<optimized out>) at widgets/qframe.cpp:262
#60 0x00007ffff6ea167a in QStackedWidget::~QStackedWidget (this=<optimized out>, __in_chrg=<optimized out>) at widgets/qstackedwidget.cpp:145
#61 0x00007ffff6ea168f in QStackedWidget::~QStackedWidget (this=0xca6b20, __in_chrg=<optimized out>) at widgets/qstackedwidget.cpp:147

(Mathieu Westphal) #17

That is indeed a bug in the implementation. Good find ! I have a fix coming. We will see if it improves.
pushed and test running :
https://gitlab.kitware.com/paraview/paraview/merge_requests/2511
https://gitlab.kitware.com/vtk/vtk/merge_requests/4400


(Mathieu Westphal) #18

I’ve updated my post, nvidia 340 did not help.

The makeCurrent fix seems to work so far. My need a little bit more testing to be sure.

Vall is still broken, even with master, so we can’t really take it’s results into account. @shawn.waldon
https://open.cdash.org/testDetails.php?test=662192932&build=5414805
https://open.cdash.org/testDetails.php?test=662192932&build=5414805
https://open.cdash.org/testDetails.php?test=662072612&build=5414805


(Mathieu Westphal) #19

I’ve remarked that on luigi, used memory is very high after a round of testing with free output lilke this :

              total        used        free      shared  buff/cache   available
Mem:          11976        10234        953          45        788        999
Swap:         12246           0       12246

and no process is using the memory, while the computer is very slow if I run tests again.

It is of course fixed by a reboot, I wonder if this is related.


(Shawn Waldon) #20

Vall has been rebooted.