Parallel Aware readers

mwestphal · March 29, 2021, 9:09am

In ParaView, they are many readers, some are parallel aware, some aren’t.

This term “parallel aware” is not strictly defined anywhere, there is a single reference to it in the doc
https://docs.paraview.org/en/latest/ReferenceManual/parallelDataVisualization.html?highlight=“parallel%20aware”#parallel-processing-in-paraview-and-pvpython

And a few instance in the source code of ParaView. I think we should defined this term precisely.

Here is a suggestion:

"A parallel aware filter or reader is a filter or reader containing dedicated code that makes it work correctly in parallel and would not work without it. Regarding filter, the stream tracer is a parralel aware filter as it would not work without dedicated code to transfer particles from one domain to the next. Regarding reader, all unstructred data reader needs to be parallel aware to output distributed data, eg: Exodus reader"

If we consider that we have an agreement here, that we need to list which readers are parallel aware and which aren’t.

If should be standard in the documentation of each unstructured data reader to specify if they are parallel aware are not. Ideally, this information should also be available as a nice list in the doc.

That lead us to my final point, regarding readers, parallel aware is not enough.

Some readers are indeed parallel aware, but still require the whole dataset to be reader from disk in order to perform the distribution.

eg:

PVTP readers on each nodes do not need to read the whole dataset, only the part that each node needs
EnSight readers on each nodes need to read the whole dataset before distributing

This is an important distinction and users needs to know about this in order to choose a file format to work with.

Lets coin a supplementary term for this:

parallel-disk aware
parallel and disk aware
parallel-part aware
parallel and part aware

Any other ideas ?

@utkarsh.ayachit @cory.quammen @nicolas.vuaille @Charles_Gueunet

utkarsh.ayachit · March 29, 2021, 11:24am

I am not entirely sure I follow the terms you propose, but why not simply use the following:

distributed reader or
distributed reader that may read the entire file(s) on all ranks

A non-distributed reader is something like legacy VTK reader that doesn’t do either i.e. split data among ranks and only reads the file on rank 0.

I think we need to move away from parallel to imply distributed. It was true once upon a time, not anymore with SMP/VTK-m etc.

mwestphal · March 29, 2021, 11:27am

I was using the parallel aware term as it was the historic term but I’m happy with a change.

Your propositiong sounds good to me.

Andy_Bauer · March 29, 2021, 12:19pm

Adding in two points here:

The PVTP reader (as well as the PVTU reader) is only as parallel/distributed as the number of files that it was partitioned into when writing it out. In general more files are written out than pvservers will be used when reading it back in but if there’s only one VTP or VTU file pointed to by the meta file then the data will only be read in by a single pvserver process no matter how many MPI pvserver processes are created.
Should ghost information be included in this discussion? Some readers produce it automatically while others do not. Without knowing which ones do and which ones don’t, knowing how to make a pipeline work correctly in parallel requires a significant amount of knowledge. For example, PVTU reader and then Point Data to Cell Data filter won’t work correctly in parallel while PVTI reader and then Point Data to Cell Data filter will work correctly in parallel.

utkarsh.ayachit · March 31, 2021, 9:57am

I believe this is true for all unstructured grid readers / file formats.

I am not sure this is reader implementation specific and hence probably not related here. Any reader will provide ghost information is present. We’re also working on streamlineing ghost-cell generation so eventually, the ParaView pipeline should be able to provide /compute ghost cells when needed with ease independently. cc: @Yohann_Bearzi

Andy_Bauer · March 31, 2021, 11:57am

I thought that the discussion here was informing PV users about the behavior of parallel/distributed aware readers. I think most advanced users realize that readers that output structured dataset formats will get different behavior with regards to cell distribution and ghost information than non structured dataset formats. Informing newer users of this subtlety would be a nice thing. If that’s not the goal of this discussion then nevermind.

mwestphal · March 31, 2021, 1:19pm

After a few tests, I think that the EnSight reader is indeed capable of doing just that.

GhostInformation

IMO it should be included as well. Some readers are able to produce GhostCells, other aren’t and documenting the behavior seems important to me.

That being said, no need for specific terms, just explaining the behavior in a standard way in the doc is fine imo.

mwestphal · August 10, 2021, 2:19pm

A first version of this documentation can be seen here. Not as deep as one would hope but better than nothing. https://gitlab.kitware.com/paraview/paraview/-/merge_requests/5098