Consider this simple pipeline: reader → calculator. As the user is constructing this pipeline the geometry filter – which is automatically executed to generate polygons for rendering – executes twice on the same geometry + topology: first time when user hits Apply
for the reader, and the another time when user hits Apply
for the calculator. Both times, the geometry filter generates exactly the same surfaces since the mesh is unchanged, only the fields are changing.
This is just a simple illustration, but we can easily keep on making changes to this pipeline to come up with scenarios where filters keep doing the computation they already did over and over again.
A question that @Will_Schroeder and I have been brainstorming on and off over several months has been how can we avoid this. Ideally, without making any cumbersome changes to the VTK pipeline or adding memory costs. Will started a discussion in a similar spirit here, however, it didn’t really go anywhere conclusive.
Here’s a more detailed proposal. I pose it as more of ParaView concern here, however, it’s broadly applicable to VTK use-cases too.
Let’s add a new global cache, let’s say vtkDataCache
.
class vtkDataCache : ...
{
public:
// this has to be a singleton to ensure cached data can
// be shared between different instances of algorithms with
// ease.
static vtkDataCache* GetInstance();
// API to access a cached item.
template <typename T, typename ...KeyT>
T* GetCachedData(ArgsT... key);
// API to add an item to the cache.
template <typename T, typename ...KeyT>
void AddToCache(T* data, vtkObject* context, KeyT... key);
};
Let’s consider an example: a modified version of vtkDataSetSurfaceFilter
to extract exterior surface from a vtkStructuredGrid – something the geometry filter in our illustrative example employs. The algorithm executes over cells and generates an originalCellIds
array to identify cells in the input dataset that are part of the exterior shell. For simplicity, let’s just focus on this step. The algorithm can easily be changed to avoid re-computation as follows:
vtkDataSetSurfaceFilter* self = ...
vtkStructuredGrid* sg = ... // input structured grid
auto points = sg->GetPoints();
auto ghostCells = sg->GetCellData()->GetArray(
vtkDataSetAttributes::GhostArrayName());
auto cache = vtkDataCache::GetInstance();
auto originalCellIds = vtk::MakeSmartPointer(
cache->GetCachedData<vtkIdTypeArray>(
"vtkDSF::OriginalCellIds", points, ghostCells));
if (originalCellIds == nullptr)
{
// iterate over cells and compute new original cell ids array
cache->AddToCache(originalCellIds, /*context=*/ self,
/* key .... */
"vtkDSF::OriginalCellIds", points, ghostCells));
}
Let’s see how this works:
Avoid re-computation: this is fairly obvious. The key for the cache is built using input points which would indeed change if the input structured grid was changed in any way that would impact the result. Here, that is the input points and ghost cells arrays. If we had already computed the originalIds arrays for the chosen points and ghost cells, we won’t recompute it. And when we have to recompute, we update the cache.
Avoid memory overhead: this is a tricky one. To understand this, we’ll have to dive into the implementation of vtkDataCache
. vtkDataCache
stores vtkObject’s. Hence it can manage reference count for the stored object to release it for garbage collection or keep it around.
When something is added to the cache, besides the object being cached, the arguments to AddToCache
take in a set of arguments which comprise the key
followed by an optional vtkObject*
that we call the context. The key can comprise of vtkObject*
s and std::string
s (any copyable, comparable type can be supported here, but for now let’s just use strings for simplicity). The cache adds ModifiedEvent
and DeleteEvent
observers to all vtkObject
s in key and context arguments. If any of those are fired, the cache simply release the reference to the cached vtkObject
, thus releasing it. That is it! This simple trick minimizes memory overhead. With key
, we are assured that it any item that affects the cached result changes, the cache will be flushed. With context
, we are assured that if the filter itself changes we don’t need to keep around cached data for it. The context
could easily be part of the key
, however, keeping it separate enables multiple instances of vtkDataSetSurfaceFilter
share the cache while still ensuring that if the filters go away, the cache is flushed.
The beauty of this approach is we no longer need to invest in any changes to the VTK pipeline or filters to deal with static meshes. Computation results are now keyed only on input arrays that affect the result and hence will not be recomputed unless necessary.
The cache can be used to store things like array ranges too. Currently, computing array ranges does not take into account masks or ghost arrays. We can easily start supporting that and use cache to avoid recomputing ranges.
The cache can be used by readers too to avoid re-reading data from disk when iterating through time or if array selection, for example, changes etc.
The cache also works for caching this things like point/cell locators without requiring any changes to dataset API. Thus, no need to start storing point-locators with vtkPoints
to avoid rebuilding them, the cache can help with that effortlessly!
It’s not limited a specific use-case either, as is the case with vtkOpenGLVertexBufferObjectCache
. The proposed cache can not only store vtkOpenGLVertexBufferObject
, but also things like ranges, locators, you name it. Also, vtkDataCache
does not require any explicit Remove
calls. Using modification and delete events fired from key and context objects, it can automatically remove obsolete entries.
Thoughts? Suggestions? Critiques?