pvbatch does not give speedup for .pvtu partitioned data

I have a dataset that is partitioned into 128 subdomains using the .pvtu framework, and I am trying to run some data processing which uses the Programmable Filter feature. I have run a trace on the GUI to do this process, and the output script is attached (along with three extra lines to output the time it takes to run). I run it using mpirun -np [n] pvbatch pv_batch_test.py. Taking [n] to be 1 actually gives the shortest runtime of ~50 seconds, while taking [n] to be 6 gives the runtime to be ~80 seconds from each of the processes. This tells me that the commands are not really being run in parallel.

A few more things:

  • I have heard that, in order to use an unstructured grid (which I am using) one needs to use the D3 filter to be able to distribute it across processes efficiently. However, I have also heard that already-partitioned unstructured grids (i.e. pvtu data) do not need this.
  • I don’t really understand how MPI interacts with the Python script, but I assume (since this is all generated in the GUI) that each of the filters ought to be parallelized, or parallelizeable – am I wrong about this?
  • For context about the actual script, the input data are the degrees of freedom of a 3x3 traceless, symmetric tensor, and the script just finds two of the eigenvectors and eigenvalues – I figured the calculation ought to be reasonably beefy to test how things operate in parallel.

I haven’t found anything in the documentation which strictly dictates what will be automatically parallelized and what will not be, but any help on that front (or getting the script to work efficiently) is appreciated.

  • Lucas

pv_batch_test.py (19.7 KB)

Hi @lucasmyers97 ,

In my answer, “Distributed, Distribution” is related to spatial distribution with MPI, “Parallel, Parallelisable” is related to multithread and parallelization of work.

This tells you there is no speedup. Distributed != Faster. We need to check the root cause first.

I have heard that, in order to use an unstructured grid (which I am using) one needs to use the D3 filter to be able to distribute it across processes efficiently. However, I have also heard that already-partitioned unstructured grids (i.e. pvtu data) do not need this.

.pvtu is indeed an statically distributed format, each process will read only some parts of the data. However, it is a statically distributed format, which means that you should not use more process than there is parts.

Using D3 (or with recent version of ParaView, RedistributeDataSet) will dynamically redistribute your data and will consider the current number of processes to do so, however this can be a very expansive operation. You do not want to do that on all your run, but you want to do that to generate statically distributed data.

I don’t really understand how MPI interacts with the Python script, but I assume (since this is all generated in the GUI) that each of the filters ought to be parallelized, or parallelizeable – am I wrong about this?

Your question is unclear here. MPI do not directly interact with the python script. All MPI does is running each instance of pvbatch with enough information for the filters to know, how many process there is and which process am I.

All filters in ParaView “support” distributed processing, although some may provide incorrect results if not configured properly in that context.

Most filters in ParaView are parralelized, which means they will use all the available cores on a node by using multithreading.

I haven’t found anything in the documentation which strictly dictates what will be automatically parallelized

Regarding multithreading, this is indeed missing altough at this point, all main filters are multithreaded.
For distributed processing, if your data is distributed, then all filters will work.

Looking at your script, the main part of the computation is part of a programmable filter using numpy:

for i in range(S.shape[0]):
    w, v = np.linalg.eig(Q_mat[:, :, i])
    w_idx = np.argsort(w)
    S[i] = w[w_idx[-1]]
    P[i] = w[w_idx[-2]]
    n[i, :] = v[:, w_idx[-1]]
    m[i, :] = v[:, w_idx[-2]]

If you are running this script on a single node, I’d not expect to see any speedup with MPI for two reasons:

  1. numpy is already multithreaded and will uses all the available cores on your node.
  2. Python is not truly multithreaded, which means the python part of your code is executed serially.

tbh, trying to optimize for programmnable filter in ParaView is a bit antinomic. The programmable filters are great but using the whole Python stack for performance will never be the best bet.

That being said, you can compute eigenvalues and eigenvectors using the “Principal Component Analysis” filter or even using the “Tensor Glyph” filter.

hth,

@mwestphal,

Thanks for the quick reply! If I can ask for just a bit more clarification:

I’ve made a little bit of progress in optimizing, and some of the timing results (seemingly) contradict some of the points you make – I suppose I have something configured incorrectly. To your point about the .pvtu format, the data that I have is statically partitioned into 128 parts, and my machine locally maxes out at 6 processes, so no worries about that.

As far as actual timing, I have run:

  • mpirun -np 6 pvbatch ./app/examples/pv_batch_test.py (prints ~90s 6 times, top shows 6 processes at 100% CPU)
  • pvbatch ./app/examples/pv_batch_test.py (prints ~60s 1 time, top shows 1 process at 100% CPU)
  • mpiexec -n 6 pvbatch ./app/examples/pv_batch_test.py (prints ~17s 1 time, top shows 6 process at 100% CPU)

Given the timings, my guess is that mpirun (for reasons I do not understand) has each process read in and apply each filter to the entire domain. Plain pvbatch, I would guess, uses just 1 process to apply each filter to the entire domain. mpiexec has each core read in 128/6 statically partitioned pieces of the domain, and apply filters.

Supposing these guesses are correct, it seems like the filters are not really multithreading the data in the plain pvbatch run or, at the very least, the scope of the for loop in my programmable filter does not take advantage of numpy’s multithreading capabilities. Further, it appears that the programmable filter only gets applied to the statically partitioned data assigned to the given process (at least when called with mpiexec) – to me this seems like a potentially large speedup, given that one does not have to worry about optimizing for numpy’s multithreading.

Any feedback on whether this reading of the timing + cpu usage data is correct?

A final question: if trying to optimize using the programmable filter is antinomic, what is the suggested alternative if the functionality is not necessarily available with the built-in filters? For context, I’d like to do the eigenvalue decomposition, calculate the gradient of the eigenvectors corresponding to the largest eigenvalues, and then query the mesh for those gradient values at some number of points. This is straightforward with the programmable filter, but I’m not sure how to look for built in filters to accomplish this task.

– Lucas

Good.

Given the timings, my guess is that mpirun (for reasons I do not understand) has each process read in and apply each filter to the entire domain.

That is just not possible unless something goes very wrong. You can easily check that by showing the ProcessID and Saving a screenshot.

However I see that you are using XMLPartitionedUnstructuredGridReader directly, why not using OpenDataFile ?

Any feedback on whether this reading of the timing + cpu usage data is correct?

Check that the data is distributed,

what is the suggested alternative if the functionality is not necessarily available with the built-in filters?

Write a C++ plugin to compute those things. But before doing that, you may want to ensure the problem indeed lies with the usage of the Programmable Filter, which I’m not sure of yet.