Loading data files that take too much memory

Patrick_Avery · April 3, 2023, 4:56pm

For Tomviz (built on top of ParaView), some users have asked that, if they try to load in a data file that will take too much memory, to present a warning/error rather than crashing (which is what it is currently doing).

Some of our readers are implemented in VTK/ParaView. I’m wondering: is there a standard way to perform a check on memory usage before loading a data file in, or capturing a bad_alloc exception, across the readers, in order to detect this issue before a segmentation fault? Or is this something that would need to be implemented on a reader-by-reader basis?

Right now, I’m checking to see what types of files these users are usually loading in to see if I can at least perform some kind of check on how much memory will beused before loading them in.

Christos_Tsolakis · April 3, 2023, 5:24pm

Maybe https://gitlab.kitware.com/paraview/paraview/-/merge_requests/5994 is related or some of its ideas could be reused ? @mwestphal

mwestphal · April 4, 2023, 5:51am

The logic of the feature @Christos_Tsolakis linked is about trying to guess the output size on memory based on the input size. This is of course not possible for a reader.

However, it would be possible to add a similar logic for readers, were ParaView would check the size of file on disk (the one in SetFilename) and guess a possible output size before pressing Apply.

This will not be as reliable as some file formats are spread on multiple files.

To account for that, It would be possible to define a API on any vtkReader that will guess a possible size of the output, to be computed during RequestInformation.

It would be down to the reader to implement this API, then ParaView could use it to warn the user.

or capturing a bad_alloc exception, across the readers, in order to detect this issue before a segmentation fault?

VTK does not handle exceptions, but nothing prevents you to handle it in your reader.

Patrick_Avery · April 4, 2023, 2:51pm

Yes, I think this sounds very reasonable. There are a few potential other issues with an estimation based only on file size, including that the data might be compressed on disk, but uncompressed when it is in memory, and that we may also only be reading a section of data from the file as well.

So I agree that giving each reader the ability to guess a possible size of the output sounds like a great idea!

wascott · April 4, 2023, 5:55pm

I think this would be reader specific whether this is even possible. An example of where it wouldn’t be possible is Exodus, CGNS and possibly VTK. (I am not a .vtk expert.) There are numerous issues that make this decision impossible. First is number of timesteps. The memory footprint of a large mesh with one timestep vs a small mesh with thousands of timesteps are orders of magnitude different, although the file sizes may be the same. Another example is where a user outputs dozens or hundreds of variables, but only reads one or two. Yet another example is a dataset that has hundreds of blocks, but again the user is OK reading in one or two. Finally, structured vs unstructured grids will have vastly different file to memory usage ratios. (Exodus can hold unstructured and structured grids.) My thoughts are that equating file size with memory usage is problematic at best and folly at worst.