Strange slice filter update behaviour

With 5.11.1 (official binaries) on a 72-core (Xeon(R) Platinum 8360Y CPU @ 2.40GHz) Linux 4.18 x86_64 node with 512 GB memory I get extremely sluggish slice plane filter performance, or at least it doesn’t seem to do anything for a long time (see below). This is with a uniform rectilinear dataset of dimensions 862x759x1526, containing 4 float32 arrays, loaded from a 41GB .vti file.

Applying a slice filter, touching no options and pressing Apply takes roughly 19 seconds. The weird thing is that for more or less the first 18 seconds no updates happen in the UI, and only in the final second is the progress bar updated and the slice plane shown. The same behaviour happens when moving the slice plane and pressing Apply.

Looking at some other operations on the data, applying a contour filter on the same dataset takes roughly 11 seconds, with the progress bar continuously updating from the start of pressing Apply. This I find bizarre that it is so much faster, given that it computes a whole lot more than a simple slice.

A Clip to half the dataset takes around 5 seconds.

Any idea why the slice filter can be so slow? Would it matter that one of the data arrays is a 6-dimensional float array?

I’m the only user on the node, btw. I am running this in a VNC server under VirtualGL (as I don’t have any other GPU server with the required amount of memory), but the OpenGL info in Help > About is correct and shows an A100 being used, and the four A100s in the node are properly detected judging from the console output. Also, when I use a Python trace to apply a slice filter and run it under pvbatch I see the same slow times.

What is the size of your data ?

You mean spatial dimensions? 0.1 x 0.09 x 0.18 units in XYZ. Other sizes are listed above.

Right I missed that.

What do you have in the Tools → TimerLog ?

Local Process

Still Render,  0.011861 seconds

    Render (use_lod: 0), (use_distributed_rendering: 0), (use_ordered_compositing: 0)
    OpenGL Dev Render,  0.011118 seconds
PropertiesPanel::Apply,  18.2156 seconds
    ParaViewPipelineControllerWithRendering::UpdatePipelineBeforeDisplay,  17.9364 seconds
        Execute Slice1 id: 9633,  17.9225 seconds
    ParaViewPipelineControllerWithRendering::Show,  0.163894 seconds
        ParaViewPipelineControllerWithRendering::Show::CreatingRepresentation,  0.15002 seconds
    RenderView::Update,  0.013708 seconds
        vtkPVView::Update,  0.013531 seconds
    Still Render,  0.099001 seconds
        Render (use_lod: 0), (use_distributed_rendering: 0), (use_ordered_compositing: 0)
        OpenGL Dev Render,  0.097776 seconds
    Render (use_lod: 0), (use_distributed_rendering: 0), (use_ordered_compositing: 0)

Looks like the slice is indeed slow, no idea why though.

How big is the dataset? Can you post it here?

I managed to generate a simple dataset that can replicate the behaviour. Basically the same dataset specs, but will all zero values, in .vtk format (based on saving the .vti from ParaView to .vtk):

#!/usr/bin/env python
h = b"""# vtk DataFile Version 5.1
vtk output
BINARY
DATASET STRUCTURED_POINTS
DIMENSIONS 862 759 1526
SPACING 0.00011814 0.00011814 0.00011814
ORIGIN 0 0 0
POINT_DATA 998397708
FIELD FieldData 1
density 1 998397708 float
"""

o = open('slice.vtk', 'wb')
o.write(h)
o.write(b'\x00\x00\x00\x00' * 998397708)

When I slice this in ParaView it takes the reported 17+ seconds:

PropertiesPanel::Apply,  17.8729 seconds
    RenderView::Update,  17.7833 seconds
        vtkPVView::Update,  17.7831 seconds
            Execute Slice1 id: 13465,  17.7697 seconds
    Still Render,  0.080111 seconds
        Render (use_lod: 0), (use_distributed_rendering: 0), (use_ordered_compositing: 0)
        OpenGL Dev Render,  0.077939 seconds
    Render (use_lod: 0), (use_distributed_rendering: 0), (use_ordered_compositing: 0)

One thing I really do not understand is that if I use a Wavelet source with the exact same dimensions the Slice filter is really quick:

PropertiesPanel::Apply,  0.309493 seconds
    ParaViewPipelineControllerWithRendering::UpdatePipelineBeforeDisplay,  0.032021 seconds
        Execute Slice1 id: 13999,  0.0283 seconds
    ParaViewPipelineControllerWithRendering::Show,  0.173746 seconds
        ParaViewPipelineControllerWithRendering::Show::CreatingRepresentation,  0.160004 seconds
    Still Render,  0.092106 seconds
        Render (use_lod: 0), (use_distributed_rendering: 0), (use_ordered_compositing: 0)
        OpenGL Dev Render,  0.090023 seconds
    Render (use_lod: 0), (use_distributed_rendering: 0), (use_ordered_compositing: 0)
    Render (use_lod: 0), (use_distributed_rendering: 0), (use_ordered_compositing: 0)

And it’s not the small spacing, I tried setting that to 1.0 in a separate .vtk file, same slow slicing behaviour.

Edit: the spacing does matter, but not much for the Slice Execute step: spacing 1.0 → 16.3953s, spacing 0.001… → 17.6362s

@berkgeveci @cory.quammen
I took a quick peek at this dataset. By the way, spectacular replicator. I have never seen a vtk file created this way. Nice.

Anyway, when I run the slice filter on this dataset we are spending LOTS of time in vtkElevationFilter, which just feels weird. Further, it is SLOW. It’s doing a TransformCoordinates on every one of the almost billion points, and doing a dot product on each point. Ouch. Adding Cory and Berk in case there is either a mistake, a smarter algorithm or a better file format that Paul should use…

Thanks @wascott for looking into this. Note that the original file format was a .vti (and not created by me). It would seem to be the best suited for this type of data, no?

I detected the same problem in some of my performance evaluations, and already have a fix for it. I don’t know if it’s going to get into 5.12, but i will try.

The elevation filter is used to create a dummy points scalar field, if none is present. Scalars are needed by vtkFlyingEdgesPlaneCutter to operate. In my fix, i just create a constant scalar array. instead of running the elevation filter.

Elevation should not take 17s anyway.

The elevation filter has a slow path and fast path (which uses smp). For non-pointset subclasses, the slow path is used, that’s why it’s so slow.

Sounds like a bug to fix.