Memory Leak in Programmable Source

jslawitsky · June 18, 2018, 5:53pm

Hello,

I’m looking to read binary data based on example 13.2.4 in the paraview guide, but I seem to be running into a memory leak with the code I’ve adapted below. When I activate my virtualenv and run the script using pvpython, memory usage grows without bounds until I manually kill the process or the os does it for me. The data set consists of 19 files of 262144 “longs” (times 5 scalar variables) - I don’t think it should be consuming 50GB+ of memory. See attached screenshot for output from activity monitor:

from paraview.simple import (
    _DisableFirstRenderCameraReset,
    GetRenderView,
    ProgrammableSource,
    Render,
    Show,
)
_DisableFirstRenderCameraReset()

request_info = """
# Activate the virtualenv with necessary packages
activate_this = '/Users/jslawitsky/Desktop/research/venv/bin/activate_this.py'
execfile(activate_this, dict(__file__=activate_this))

import numpy
import yaml

executive = self.GetExecutive()
outInfo = executive.GetOutputInformation(0)
with open("/Users/jslawitsky/Desktop/research/sample/input/config.yaml", "r") as fp:
    config = yaml.load(fp)

outInfo.Set(
    executive.WHOLE_EXTENT(),
    0, config["grid_points.x"] - 1,
    0, config["grid_points.y"] - 1,
    0, config["grid_points.z"] - 2,
)
outInfo.Set(
    vtk.vtkDataObject.SPACING(),
    config["domain_size.x"] / config["grid_points.x"],
    config["domain_size.y"] / config["grid_points.y"],
    config["domain_size.z"] / (config["grid_points.z"] - 1),
)
outInfo.Set(
    vtk.vtkDataObject.ORIGIN(),
    0,
    0,
    (config["domain_size.z"] / (config["grid_points.z"] - 1)) / 2,
)

outInfo.Remove(executive.TIME_STEPS())

start = config["output.instantaneous.skip"] * config["time_step_size"]
stop = config["number_of_time_steps"] * config["time_step_size"]
step = config["output.instantaneous.frequency"] * config["time_step_size"]

for timestep in numpy.linspace(start, stop, (stop - start) / step):
    outInfo.Append(executive.TIME_STEPS(), timestep)

outInfo.Remove(executive.TIME_RANGE())
outInfo.Append(executive.TIME_RANGE(), config["output.instantaneous.skip"])
outInfo.Append(executive.TIME_RANGE(), config["number_of_time_steps"])
"""

request_data = """
# Activate the virtualenv with necessary packages
activate_this = "/Users/jslawitsky/Desktop/research/venv/bin/activate_this.py"
execfile(activate_this, dict(__file__=activate_this))

import numpy
import os
import yaml
from distutils.util import strtobool

from vtk.numpy_interface import algorithms as algs
from vtk.numpy_interface import dataset_adapter as dsa

executive = self.GetExecutive()
outInfo = executive.GetOutputInformation(0)

with open("/Users/jslawitsky/Desktop/research/sample/input/config.yaml", "r") as fp:
    config = yaml.load(fp)

scalars = {"p", "u", "v", "w"}
if strtobool(config["output.instantaneous.write_eddy_viscosity"]):
    scalars.add("nu_t")

ts = outInfo.Get(executive.TIME_STEPS())
if outInfo.Has(executive.UPDATE_TIME_STEP()):
    t = int(outInfo.Get(executive.UPDATE_TIME_STEP()))
else:
    print "no update time available"
    t = 0

exts = [executive.UPDATE_EXTENT().Get(outInfo, i) for i in xrange(6)]
nbfiles = (
    (config["number_of_time_steps"] - config["output.instantaneous.skip"]) /
    config["output.instantaneous.frequency"]
)

dims = [
    exts[1] - exts[0] + 1,
    exts[3] - exts[2] + 1,
    exts[5] - exts[4] + 1,
]

output.SetExtent(exts)

data = numpy.empty(dims[0] * dims[1] * dims[2] * nbfiles, dtype=numpy.dtype('<f'))

points_per_step = (
    config["grid_points.x"] *
    config["grid_points.y"] *
    (config["grid_points.z"] - 1)
)

dir_name = "/Users/jslawitsky/Desktop/research/sample/output/instantaneous-fields/"

def AddScalarVariable(nbfiles, varname):
    for i in xrange(nbfiles):
        # example file format: 'dir_name/p-001.bin'
        fname = "{}{}-{}.bin".format(dir_name, varname, str(i).zfill(3))
        with open(fname, "rb") as fp:
            fp.seek(0)
            begin = points_per_step * i
            end = points_per_step * (i + 1)
            data[begin:end] = numpy.fromfile(fp, dtype=numpy.dtype("<f"), count=-1, sep="")

    output.PointData.append(data, varname)

for varname in scalars:
    AddScalarVariable(nbfiles, varname)

output.GetInformation().Set(output.DATA_TIME_STEP(), t)
"""

View1 = GetRenderView()
pSource1 = ProgrammableSource()
pSource1.OutputDataSetType = 'vtkImageData'
pSource1.ScriptRequestInformation = request_info
pSource1.Script = request_data
pSource1.UpdatePipelineInformation()

Show(pSource1, View1)
Render()

When I comment out output.PointData.append(data, varname), the code runs to completion, but nothing is displayed (obviously because I haven’t appended any data).

Any insights would be greatly appreciated.

Thanks,
Jon

cory.quammen · June 22, 2018, 2:55pm

You are right, 50GB is way more memory than is needed.

Try loading 1 file instead of 19. What is the memory usage then? Then try 2, then 3. Do you see a pattern in the memory growth?

jslawitsky · June 22, 2018, 5:40pm

Thanks for the response Cory! I apologize, I should have added more detail to my initial post: I tried loading just a single file and the memory usage grew without bounds in that scenario too. The growth in memory usage seemed slightly slower for one file, but I can’t say for sure.

What do you think a pattern in the memory usage would indicate? I can try doing some timed tests to see if there is indeed a pattern. Do you have any other suggestions on how I might debug this?

Thanks again,
Jon

cory.quammen · June 25, 2018, 3:29pm

The patterns I was thinking of:

a linear increase in memory with increasing number of files read would indicate memory is not being free’d somewhere
an exponential increase in memory might indicate a haywire append process where you are appending a the data to itself, for instance

Other things to check:

Are the dims values reasonable?
Are you allocated memory in data for the entire time series? It looks like you are data = numpy.empty(dims[0] * dims[1] * dims[2] * nbfiles, dtype=numpy.dtype('<f')). I would load only the data for the current time step to reduce the memory usage.

jslawitsky · June 27, 2018, 1:01am

Ok, based on my tests it seems like the increase in memory is exponential. However, I don’t think I’m appending data to itself. I did take your advice on loading only the data for the current step. I also noticed that I wasn’t actually updating the time step correctly, but I believe I’ve fixed that now. Unfortunately, the code is still consuming memory exponentially.

Here is the updated code:

from paraview.simple import (
    _DisableFirstRenderCameraReset,
    GetRenderView,
    ProgrammableSource,
    Render,
    Show,
)
_DisableFirstRenderCameraReset()

request_info = """
# Activate the virtualenv with necessary packages
activate_this = '/Users/jslawitsky/Desktop/research/venv/bin/activate_this.py'
execfile(activate_this, dict(__file__=activate_this))

import numpy
import yaml
from vtk.numpy_interface import dataset_adapter as dsa
from vtk.numpy_interface import algorithms as algs

executive = self.GetExecutive()
outInfo = executive.GetOutputInformation(0)
with open("/Users/jslawitsky/Desktop/research/sample/input/config.yaml", "r") as fp:
    config = yaml.load(fp)

outInfo.Set(
    executive.WHOLE_EXTENT(),
    0, config["grid_points.x"] - 1,
    0, config["grid_points.y"] - 1,
    0, config["grid_points.z"] - 2,
)
outInfo.Set(
    vtk.vtkDataObject.SPACING(),
    config["domain_size.x"] / config["grid_points.x"],
    config["domain_size.y"] / config["grid_points.y"],
    config["domain_size.z"] / (config["grid_points.z"] - 1),
)
outInfo.Set(
    vtk.vtkDataObject.ORIGIN(),
    0,
    0,
    (config["domain_size.z"] / (config["grid_points.z"] - 1)) / 2,
)
outInfo.Remove(executive.TIME_STEPS())

start = config["output.instantaneous.skip"] * config["time_step_size"]
stop = config["number_of_time_steps"] * config["time_step_size"]
step = config["output.instantaneous.frequency"] * config["time_step_size"]

for timestep in numpy.linspace(start, stop, (stop - start) / step):
    outInfo.Append(executive.TIME_STEPS(), timestep)

outInfo.Remove(executive.TIME_RANGE())
outInfo.Append(executive.TIME_RANGE(), start)
outInfo.Append(executive.TIME_RANGE(), stop)
"""

request_data = """
# Activate the virtualenv with necessary packages
activate_this = "/Users/jslawitsky/Desktop/research/venv/bin/activate_this.py"
execfile(activate_this, dict(__file__=activate_this))

import numpy
import os
import yaml
from distutils.util import strtobool

from vtk.numpy_interface import algorithms as algs
from vtk.numpy_interface import dataset_adapter as dsa

executive = self.GetExecutive()
outInfo = executive.GetOutputInformation(0)

with open("/Users/jslawitsky/Desktop/research/sample/input/config.yaml", "r") as fp:
    config = yaml.load(fp)

exts = [executive.UPDATE_EXTENT().Get(outInfo, i) for i in xrange(6)]
nbfiles = (
    (config["number_of_time_steps"] - config["output.instantaneous.skip"]) /
    config["output.instantaneous.frequency"]
)

dims = [
    exts[1] - exts[0] + 1,
    exts[3] - exts[2] + 1,
    exts[5] - exts[4] + 1,
]

output.SetExtent(exts)

scalars = {"p", "u", "v", "w"}
if strtobool(config["output.instantaneous.write_eddy_viscosity"]):
    scalars.add("nu_t")

points_per_step = (
    config["grid_points.x"] *
    config["grid_points.y"] *
    (config["grid_points.z"] - 1)
)

dir_name = "/Users/jslawitsky/Desktop/research/sample/output/instantaneous-fields/"

def get_scalar_variable_data(nbfiles, varname):
    # example file format: 'dir_name/p-001.bin'
    fname = "{}{}-{}.bin".format(dir_name, varname, str(i).zfill(3))
    with open(fname, "rb") as fp:
        data = numpy.fromfile(fp, dtype=numpy.dtype("<f"), count=-1, sep="")

    return data

timesteps = outInfo.Get(executive.TIME_STEPS())
for i in xrange(nbfiles):
    if outInfo.Has(executive.UPDATE_TIME_STEP()):
        t = timesteps[i]
    else:
        t = 0

    for varname in scalars:
        data = get_scalar_variable_data(nbfiles, varname)
        # output.PointData.append(data, varname)

    output.GetInformation().Set(output.DATA_TIME_STEP(), t)
"""

View1 = GetRenderView()
pSource1 = ProgrammableSource()
pSource1.OutputDataSetType = 'vtkImageData'
pSource1.ScriptRequestInformation = request_info
pSource1.Script = request_data
pSource1.UpdatePipelineInformation()

Show(pSource1, View1)
# Render()

Here’s the outInfo, which looks correct to me:

vtkInformation (0x7fa432ecc560)
  Debug: Off
  Modified Time: 296594
  Reference Count: 3
  Registered Events: (none)
  UPDATE_TIME_STEP: 0
  COMBINED_UPDATE_EXTENT: 0 63 0 63 0 63
  CONSUMERS: vtkPVPostFilterExecutive(0x7fa435d2bf40) port 0
  UPDATE_EXTENT: 0 63 0 63 0 63
  TIME_RANGE: 2.5 50
  TIME_STEPS: 2.5 5.13889 7.77778 10.4167 13.0556 15.6944 18.3333 20.9722 23.6111 26.25 28.8889 31.5278 34.1667 36.8056 39.4444 42.0833 44.7222 47.3611 50
  SPACING: 0.0981748 0.0981748 0.015625
  WHOLE_EXTENT: 0 63 0 63 0 63
  ORIGIN: 0 0 0.0078125
  UPDATE_NUMBER_OF_GHOST_LEVELS: 0
  UPDATE_PIECE_NUMBER: 0
  UPDATE_NUMBER_OF_PIECES: 1
  DATA_OBJECT: vtkImageData(0x7fa432ecf7d0)
  PRODUCER: vtkPVCompositeDataPipeline(0x7fa432ecb5a0) port 0

However, when I print output.GetInformation() I see a difference:

vtkInformation (0x7f90265d7f60)
  Debug: Off
  Modified Time: 296606
  Reference Count: 2
  Registered Events: (none)
  DATA_EXTENT: 0 63 0 63 0 63
  DATA_EXTENT_TYPE: 1

Maybe this has something to do with it?

jslawitsky · July 10, 2018, 11:41pm

I’ll use this post to track what I’ve tried so far:

Reshaping the ndarray to the grid dimensions and appending that to the output.
Downgrading from the latest version of paraview to version 3.98.

So far nothing has worked. Based on the documentation I can append an ndarray to the output like so: output.PointData.append(ndarray, "data_name"), or am I wrong about that?

shawn.waldon · July 11, 2018, 2:33pm

That sounds right to me.

But this part doesn’t. Your ndarray should have an entry for each point on the grid, but it should not be the shape of the grid. It should be (volume x 1) for scalar data and (volume x 3) for vector data, where volume is the number of voxels (points) in your grid.

jslawitsky · August 24, 2018, 6:01pm

I uninstalled 5.5.2 and reinstalled it, and I don’t get the memory leaks any longer. The grid and my scalars are visible in the client, however, when I run the animation there are no visual changes. When I print data for each scalar over each time step, the values in are unique per time step and scalar. Below is the relevant code, can anyone tell me what I’m missing?

timesteps = outInfo.Get(executive.TIME_STEPS())
for i in xrange(nbfiles):
    t = timesteps[i]

    for varname in scalars:
        data = get_scalar_variable_data(nbfiles, varname)
        output.PointData.append(data, varname)

    output.GetInformation().Set(output.DATA_TIME_STEP(), t)

Would someone be willing to jump on a Skype call with me to debug this? I think it’s a trivial mistake, but I can’t seem to figure it out.

shawn.waldon · August 24, 2018, 6:39pm

Try adding a call to RenderAllViews() at the point(s) where you want to see the changes.

Someone_fromjapan · August 27, 2018, 6:24am

when I run the animation there are no visual changes

BTW, if you find it related.
In the past I was trying using python calc for importing columns as time steps.
I had no luck in this particularly because every time trying the calc script I had to restart ParaView. Second-time execution of the script just did not work, until restarted. System: Windows 7x64, that time it was “stable” release of PV 5.5.

try restarting

utkarsh.ayachit · August 27, 2018, 12:19pm

Looks like you’re putting together a time-aware source. For those, you need to provide a script for RequestInformation too. Is that provided? Did you look at 13.2.2 in the ParaView User’s Guide? It has an example for a CSV file series reader. That provides examples code snippets for all necessary parts.

p.s. please start a new thread for a different question. it makes it easier for anyone to follow the conversation. thanks.