Read performance problems (serial & parallel)

Erik_Keever · February 7, 2019, 3:24am

Hello,

I need to use PV to visualize/analyze >1TB data sets (900M cell structured grid time series) in which each stored data frame is approximately 50GB.

I am preparing by making sure I can efficiently access a much smaller (1.6GB/frame) test dataset. However, I have now tried reading the same data stored in Ensight Gold, .vtk, and .h5+xdmf and all take hugely longer than my media speed would imply, whether in serial or connecting to 1/2/4 pvservers on my laptop.

I have a 4 (physical) core Xeon, 32GB of DDR4 and the data is on an SSD capable of 2GBps, so I assume it should take on the order of a few seconds to read a few gigs. However I am consistently seeing times ten times longer or more.

The best I got was with 4 pvservers reading the Ensight data, with the following timer logs:
PARAVIEW TIMER LOG:
RenderView::Update, 15.7664 seconds
ensight::GatherInformation, 21.3152 seconds <----------- ???
Contour::GatherInformation, 0.469148 seconds
(…)

REPRESENTATIVE SERVER TIMER LOG:
RenderView::Update, 15.7663 seconds
vtkPVView::Update, 15.7659 seconds
Execute vtkPGenericEnSightReader id: 8434, 12.0105 seconds
(…)

Something seems to be wrong. Am I missing some magic compile flag or another?

wascott · February 7, 2019, 6:44pm

How many files are you spreading your data set into? Many of the readers won’t spread data, thus if you have one file, you could add thousands of pvservers, and not get any improvement.

Another thought is that you are probably data load/ IO bound (as a guess). Adding pvservers just makes more processes asking for bandwidth. For four pvservers, each would read 1/4 the speed of a single pvserver. Just to be clear, why do you think you are using 4 pvservers? Did you build them? If you didn’t custom build the pvservers, using MPI, and then use a job submissionI suspect you are only running one pvserver under the covers. Or, are you in the Settings palen, and enabled Elable Auto MPI? I find that actually slows down a visualization. (I am actually not sure why we expose that option.)

Don’t forget that with very big datasets (and 900 million cells structured is a very big dataset), clusters use parallel file systems that are extremely fast. They also use numerous complete motherboards (known as nodes) with a dozen cores each. Your laptop is no where near that capability.

This almost definitely isn’t a rendering issue, but rather a reader issue. Your graphics card really doesn’t matter. I will say that, assuming you actually ARE using multiple pvservers, compiled with OSMesa, you aren’t even using the rendering card. Rendering is being done in software.

I’m actually VERY impressed with the performance of your desktop. Really, what you need is a cluster, or put up with slower performance.

Alan

Erik_Keever · February 11, 2019, 12:52am

Hi Walter,

The data emitted “raw” from the big simulation is one file per MPI rank (The code uses one rank to drive all the GPUs on a node, so from my 900M cell big simulation I get 4 x 18GB files per saved step, each containing the work done by two K80s), currently by default as .h5; The test dataset is only one file/step.

I have export-to-paraview functionality that emits both Ensight Gold (single file per variable per step) and legacy vtk (single file per step), and have now managed to brew a working XDMF writer to give ParaView access to the original .h5 outputs.

I know I’m using 4 pvservers because my build is MPI enabled,

erik-k@eriks_lappy ~/manual_install/ParaView-v5.6.0/build $ grep USE_MPI CMakeCache.txt
ICET_USE_MPI:BOOL=ON
PARAVIEW_USE_MPI:BOOL=ON
PARAVIEW_USE_MPI_SSEND:BOOL=OFF
VTK_VPIC_USE_MPI:BOOL=ON
VTK_XDMF_USE_MPI:BOOL=OFF

(the cluster build is identical) and both the pvserver output and the PV timer log show that I’m connected to however many I mpirun (either on laptop or on cluster). Both h5+xdmf and Ensight show that the frame data is distributed (vtkProcessId is not constant), though of the two only h5+xdmf appears to distribute intelligently.

It’s definitely not a rendering issue (the timer logs show that when data is distributed, both contouring and actual rendering are fast). The smaller (1.6GB) test dataset I’m working with on my laptop is definitely not IO bound: The SSD I’m reading from is specified for 3200MBps sequential read, and four instances of dd reading four separate frame files netted 2.8GBps. I figure the bulk of the time has to do with processing the 3DSMesh rather than the eternity between words.

Well, it seems that my issue lays in getting the data readers to access the data in parallel then. I’m managing to get on well with xdmf, except now something deep inside the pvserver is segfaulting when run in parallel now.