Processing very large `vtkMultiBlockDataSet` - performance & pipeline suggestions // working example attached

M1kol4j · September 17, 2023, 11:16am

Dear ParaView experts,

Context: I am working on a very large astrophysical simulation. Our parallel code splits the computational domain into hundreds of thousands of small, independent, structured grids (typical size 24^3 gridpoints). In one variant of that simulation setup, these meshes are slightly tilted with respect to each other (see first figure). This is to cover the surface of the star.

Pipeline: To visualise such a setup, I first convert all tiny meshes from native format to individual *.vts files using the PyVista StructuredGrid() & save() functions. Next, I manually write the content of the *.vtm file from the list of all the *.vts files. The content of this file looks something like this:

<?xml version="1.0"?>
<VTKFile type="vtkMultiBlockDataSet" version="1.0" byte_order="LittleEndian" header_type="UInt32" compressor="vtkZLibDataCompressor">
	<vtkMultiBlockDataSet>
	<Block index="1">
			<DataSet index="10513" time="84.400000" name="Block-10513" file="VB96f_0211/VB96f_0211_10513.vts"/>
			<DataSet index="10519" time="84.400000" name="Block-10519" file="VB96f_0211/VB96f_0211_10519.vts"/>
			....
			<DataSet index="10543" time="84.400000" name="Block-10543" file="VB96f_0211/VB96f_0211_10543.vts"/>
	</Block>
	</vtkMultiBlockDataSet>
</VTKFile>

I have no problem reading this file in ParaView; despite this setup being made out of small cubes, a spherical cut makes a nice, smooth preview of the surface (see second figure), so

Kudos to ParaView!!

Issues and questions: The only problem this pipeline has is performance. By default I am firing up ./mpiexec -np 8 -map-by numa ./pvserver --disable-xdisplay-test --force-offscreen-rendering and I see that data is nicely distributed, but initial I/O takes forever (~40 min for our small case i.e., 10k of small meshes, until I see initial outline frame), and during readout, it looks like only one core is working.

For comparison, I have also exported small grids (still using PyVista) to multiple *.vtk files and I fussed them into one big *.vtk with this script:

import glob
from vtk import *

reader = vtkStructuredGridReader()
append = vtkAppendFilter()

filenames = glob.glob('VB6*.vtk')
# filenames = filenames[0:2]

for file in filenames:
    reader.SetFileName(file)
    reader.Update()
    structured = vtkStructuredGrid()
    structured.ShallowCopy(reader.GetOutput())
    append.AddInputData(structured)

append.Update()

# writer = vtkPolyDataWriter()
writer = vtkUnstructuredGridWriter()
writer.SetFileName('output.vtk')
writer.SetInputData(append.GetOutput())
writer.Write()

This seems to read in a bit faster (even if on one core!). Also slicing, it is faster in that case. This leads me to believe that maybe I am using the wrong format and there is some obvious flaw in my pipeline.

Questions: Is there a better format for what I aim to do? I was trying to fuse only some groups of my small meshes into N *.vtk files; next, I was trying to build an XML/*.vts file with their list, but ParaView didn’t like it. I have also tried to add more blocks i.e,. <Block index="n"> to my *.vts but that made I/O even slower.

Any advice and guidance is welcome since, with every iteration, we are adding more and more small elements, and visualisation is more and more troublesome with the pipeline we have right now.

nicolas.vuaille · September 18, 2023, 10:05am

Hi!

This is a really nice use case, thanks for sharing!
As you noted, the MultiBlock data structure does not have a good scaling.

I think grouping them into one unstructured grid is the way to go.
Some thoughts that come to my mind:

For best I/O performance, you may want to have a look to the vtk hdf file format. You will need to write the file by yourself, though (no vtk hdf writer)
otherwise, prefer vtkXMLUnstructuredGridWriter to create a .vtu file (over the vtk writer)
If not already the case, a RedistributeDataSet filter as first pipeline item will ensure your load balancing.

nicolas.vuaille · September 18, 2023, 11:34am

Oh, and in your aggregator script, instead of a AppendDataSet, you may try to use a vtkGroupDataSetFilter with SetOutputTypeToPartitionedDataSet() property.

Then you can write the result with vtkXMLPartitionedDataSetWriter.

This structure, similar to multiblock, may have a better performance scaling.

utkarsh.ayachit · September 18, 2023, 12:53pm

@nicolas.vuaille , the dataset here is VTS, so a structured grid. I’d stay away from converting to unstructured grid (which is what both these would entail) to avoid unnecessary conversions and bloat (both memory and performance wise).

nicolas.vuaille · September 18, 2023, 1:27pm

Indeed, I was focused on the suggested script that already converts to unstructured grid.

M1kol4j · September 18, 2023, 8:02pm

Dear @nicolas.vuaille and @utkarsh.ayachit, your comments put me on the right track and with a few changes to my pipeline, performance improved drastically. Here are things I have changed for those who might face similar problems:

Stage 1 [ from raw format to vtk ] It looks like saving structured grid via PyVista to *.vtk not *.vts is way faster. So, the first change I made is to use:

X,Y,Z = np.meshgrid(Xc,Yc,Zc) # Xc,Yc,Zc are axis after rotation for that titled cubes effect
# ... fill mesh 
mesh = pyvista.StructuredGrid(X,Y,Z)
mesh.save("%s/%04d/%s_%d.vtk" % (work_dir,p.rank,snap_base_name,id))

Since our raw format stores data in a few easy-to-read binary files, saving too many *.vtk files can be easily done in parallel, and it takes around 150.0 [s] to create 32k small *.vtk cubes of size 32^3 with 16 AMD EPYC cores.

Stage 2 [ many *.vtk to few *.vtu ] Since our data is from massive parallel simulation with well-balanced domain decomposition, each small mesh knows to which MPI rank it belongs. I have used that information to fuse all (now vtk meshes) from the same rank into one *.vtu file. For that, I used the following code:

  def combine2vtu(sub_dir): # one sub-dir per MPI rank with many *.vtk files in it 
    reader = vtkStructuredGridReader()
    append = vtkAppendFilter()

    filenames = glob.glob(sub_dir+'/VB6*.vtk')

    for file in filenames:
      reader.SetFileName(file)
      reader.Update()
      structured = vtkStructuredGrid()
      structured.ShallowCopy(reader.GetOutput())
      append.AddInputData(structured)

    append.Update()

    writer = vtkXMLUnstructuredGridWriter()
    writer.SetFileName(sub_dir+'/output.vtu')
    writer.SetInputData(append.GetOutput())
    writer.Write()

    # clean up
    for file in filenames:
      os.remove(file)

    return True

Again, this can be done in parallel // 160s for that 32k elements fused into 24 vtu files on the same machine.

Stage 3 [ creating vtm file ]

Finally, I created the final vtm file with as many blocks as many MPI ranks we have used for a particular simulation.

    with open(ffile_name, 'w') as f:
      f.write("<?xml version=\"1.0\"?>\n")
      f.write("<VTKFile type=\"vtkMultiBlockDataSet\" version=\"1.0\" byte_order=\"LittleEndian\" header_type=\"UInt32\" compressor=\"vtkZLibDataCompressor\">\n")
      f.write("\t<vtkMultiBlockDataSet>\n")

      for rank in id_rank.keys():
        f.write("\t\t<Block index=\"%d\">\n" % rank)
        master_file = "%s/%04d/output.vtu" % (work_dir,rank)
        f.write("\t\t\t<DataSet index=\"0\" time=\"%f\" name=\"rank-%d\" file=\"%s\"/>\n" % (snap_time,rank,master_file))
        f.write("\t\t</Block>\n")

      f.write("\t</vtkMultiBlockDataSet>\n")
      f.write("</VTKFile>\n")

And this works perfectly! Both initial I/O and slicing work within seconds. Here is a screen with the final domain decomposition and memory footprint //Although this near-perfect load balancing is inherited from the domain decomposition of the simulation.

Of course, there is room for improvement; for instance, it would be great if I could somehow skip creating so many *.vtk files i.e., maybe there is a way to create appropriate vtk objects before passing them to vtkXMLUnstructuredGridWriter()?

Ideas comments?

nicolas.vuaille · September 20, 2023, 9:16am

As far as I can see, pyvista object inherits from VTK one. So you can use your mesh object as a vtkStructuredGrid directly in append.AddInputData, without the .vtk I/O part.

Then, as Utkarsh stated, Unstructured Grid (i.e. the output of the Append filter) are the most generic data object we have and thus are the worst in term of memory and performance.

One way to explore to mitigate this is to use vtkExplicitStructuredGrid. As shown in this blog post, this is is a tradeoff between Structured and Unstructured data.
But this may be an optimization for the processing part, once in ParaView.