Skip to content

[Bug]: Memory not freed as expected after plotting heavy plot involving looping #27138

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
spacescientist opened this issue Oct 18, 2023 · 22 comments
Labels
Community support Users in need of help.

Comments

@spacescientist
Copy link

spacescientist commented Oct 18, 2023

Solution

#27138 (comment)

Bug summary

I work with a large 1D dataset. For a statistical analysis, I am using the bootstrap method, resampling it many times.

I am interested in looping over all cases in order to put together on a single figure a specific result for all resamplings.
Memory issues take place though (e.g. not freed before the very end of the script, or even leaks).

Here I document some things that at least partially address the issue. None is fully satisfactory though.
I am running the same script both from Python and as a Jupyter notebook (synchronised via jupytext). I am trying to get rid of the memory issues in both cases (the RAM usage easily reaches 16–32 GB once I start playing with enough data).

Code for reproduction

import matplotlib.pyplot as plt
import numpy as np

da = np.sort(np.random.random(int(3e6))) # Beware: please lower this if your system has less than 32 GB

def custom_plot(da, **kwargs):
    """da is a 1-D ordered xarray.DataArray or numpy array (containing tons of data)"""
    plt.yscale('log')
    n = len(da)
    return plt.plot(da, 1 - (np.arange(1, n+1) / (n+1)), **kwargs);

def resampling_method(da, case):
    """
    A complex thing in reality but for the MWE, let us simply return da itself.
    It will lead to the same memory problem.
    """
    return da

plt.figure(figsize=(15, 8), dpi=150)
plt.ylim((1e-10,1))

for case in np.arange(50):
    custom_plot(resampling_method(da, case)) # each time getting the curve for a different resampling of da
custom_plot(da) # curve for the original da

plt.savefig("output.png")
plt.show()

import gc
gc.collect();

print("Technically the programme would continue with more calculations.")
print("Notice how the memory won't be freed however until the entire script is finished.")
import time
time.sleep(120) # This simulates the fact that the programme continues with other calculations.
print("Now the programme exits")

Actual outcome

Memory issues are taking place no matter what I've tried so far. Depending on what is being attempted, it can lead to the memory either not being freed after the plot has been shown/is closed, or even memory leaks and massive swap usage.

Expected outcome

Memory freed well before the end of the programme. I would expect it to be freed soon after the figure is closed.

Additional information

NB: I did also try many other things (incl. plt.cla and the like), as well as changing backend (notably "Agg" and "Qt5Agg") but that did not solve the problem in the slightest, so I won't document them.

Things that have some effect

  1. If you do plt.show()
  • It will show the plot as a new window when run from the terminal with Python but the memory usage related to the figure won't be freed after that. It will remain in use until the end of the entire script.
  • it will be freed in Jupyter soon after displaying the figure however.
plt.show()
import gc
gc.collect();
  1. If you use block=False, time.sleep and close('all'), the memory will be freed after the plot has been created both with Jupyter and Python. However, in Python, a window will be created (stealing focus) and nothing will ever appear in it (it will be closed after 5 seconds). It'd therefore be tempting to comment out plt.show(block=False) but if you do, Jupyter will no longer clear the memory...
plt.show(block=False) ## If you comment this out, then Jupyter will not clear the memory..
import time
time.sleep(5)
plt.close('all')
import gc
gc.collect();
  1. Given what precedes, let us check whether Jupyter or Python is being used.
def type_of_script():
    """source: https://stackoverflow.com/a/47428575/452522"""
    try:
        ipy_str = str(type(get_ipython()))
        if 'zmqshell' in ipy_str:
            return 'jupyter'
        if 'terminal' in ipy_str:
            return 'ipython'
    except:
        return 'terminal'

if type_of_script() == "jupyter":
    plt.show(block=False) ## If you comment this out, then Jupyter will not clear the memory..
else:
    pass
import time
time.sleep(5)
plt.close('all')
import gc
gc.collect();

With this:

  • Jupyter will create a file and will also display the figure inline.
  • Python will only create a file and won't try to show a window
    Both will clean the memory after that figure has been closed (or 5 seconds after rather).
    This is the most satisfactory one... not exactly nice though.
    Should one want to have the figure displayed when running python from CLI however, I haven't found a method
    where the memory wouldn't remain in use until the very end of the entire script.

Some further notes:

Operating system

Ubuntu

Matplotlib Version

3.7.3

Matplotlib Backend

module://matplotlib_inline.backend_inline (default)

Python version

3.8.10

Jupyter version

6.5.2

Installation

pip

@jklymak
Copy link
Member

jklymak commented Oct 18, 2023

Please provide a reproducible example. For instance how many cases do you have and how large is each one? Thanks.

@jklymak jklymak added the status: needs clarification Issues that need more information to resolve. label Oct 18, 2023
@spacescientist
Copy link
Author

spacescientist commented Oct 18, 2023

@jklymak OK, I have added fake data. I have tested the updated script and it should work out of the box as it stands now.

Of course, you won't see that the RAM is not freed before the end of the script if the script ends at soon as the figure is closed or displayed. In reality there are more calculations after that.

EDIT: I have therefore added some time.sleep in my MWE toy script after the garbage collection, so that you can see that the memory load remains long after the image has been shown/closed, as it does in reality.

Thanks a lot for your time.

Please beware if your RAM is lower than 32 GB. With 3e6, I get a usage of a bit more than 17 GB.

Since you added 'needs clarification': the answer is ~3 million floats and ~50 samples. You'd get the same for either more or less than that. It simply always happens: no matter how much data I consider (but it's easier to see when it's a lot of data).

@tacaswell
Copy link
Member

TL;DR Either using the 'agg' backend with pyplot + plt.close('all') or not using pyplot at all will fix the problem.

I disagree that this is distinct from #20300, in the case where you are using a GUI backend it is exactly the same problem.


I think part of the issue here is a confusion over the internal structure of pyplot. The inline backend (which is maintained by the Jupyter team) is different in some important ways than every other backend (particularly the ones we support directly). For a long discussion of these differences see matplotlib/ipympl#171 and a short summary ipython/matplotlib-inline#13 (comment) . The most important one in this case is that inline backend removes a figure from the pyplot global registry on show (that is the Figure objects self-destruct at the end of a cell!).

Because Python manages memory for us it won't be freed until Python agrees that there are no references to the objects left. If you create the Figure via pyplot then we hold a hard reference to the Figure and internally Matplotlib has a lot of circular references. Until all non-cyclic references are gone and Python runs gc an the right generation level our objects won't be cleared. See discussion in #23712 for more details.

To solve this you have two options. The first is to make sure that there are no hard references to the objects that you do not control (so that the scoping / lifetime rules of Python apply as you expect!). The easiest way to do that is to not use pyplot. This example: https://matplotlib.org/stable/gallery/user_interfaces/web_application_server_sgskip.html#embedding-in-a-web-application-server-flask is written for using Matplotlib inside of a webserver, but that is a stand in for a long-running process that will generate lots of figures a human will never loop at (in that process).

The other option is to use a non-interactive backend that will not create any extraneous GUI related objects. For the interactive backends there are even more circular references because we need to hook up callbacks to the UI event system so that we can correctly react to user input by changing the Figure (pan / zoom / mouse over coords, picking, etc).

https://github.com/matplotlib/mpl-gui is some prototype work on how to make some of this easier to manage.

@spacescientist
Copy link
Author

@tacaswell From that issue, I did read:
"This only happens if you work in a GUI backend, create new figures, but don't show them."

However in the case of this issue, I do want to show the figures. The issue does take place when using plt.show().

@spacescientist
Copy link
Author

spacescientist commented Oct 18, 2023

That being said, I am of course going to read in more detail what you explain (I am far from being a matplotlib expert).

I really do appreciate that you have taken the time to write this, but I'm just afraid that you might too hastily close this — which is why I've wanted to react quickly to your comment.

@tacaswell
Copy link
Member

The key detail is that the GUI event loop needs to be allowed to run (see https://matplotlib.org/stable/users/explain/figure/interactive_guide.html).

If you do plt.show(block=True) does the issue happen?

Adding a plt.pause(.01) after show fix the problem?

@spacescientist
Copy link
Author

spacescientist commented Oct 18, 2023

EDIT: My apologies, I was too quick in my reply: so I have retracted my comment and decided to further check all this.

So I have tried what you just said (though I did already make some tests with these backends myself before opening this issue):

using the 'Agg' backend with pyplot + plt.close('all')

Does not work. I have also tried adding plt.block(True), or adding plt.pause, e.g.

import matplotlib.pyplot as plt
import numpy as np
import matplotlib
matplotlib.use('Agg')
# %matplotlib inline

da = np.sort(np.random.random(int(3e6))) # Beware: please lower this if your system has less than 32 GB

def custom_plot(da, **kwargs):
    """da is a 1-D ordered xarray.DataArray or numpy array (containing tons of data)"""
    plt.yscale('log')
    n = len(da)
    return plt.plot(da, 1 - (np.arange(1, n+1) / (n+1)), **kwargs);

def resampling_method(da, case):
    """
    A complex thing in reality but for the MWE, let us simply return da itself.
    It will lead to the same memory problem.
    """
    return da

plt.figure(figsize=(15, 8), dpi=150)
plt.ylim((1e-10,1))

for case in np.arange(50):
    custom_plot(resampling_method(da, case)) # each time getting the curve for a different resampling of da
custom_plot(da) # curve for the original da

plt.savefig("output.png")
#plt.show()
plt.show(block=True)

import gc
gc.collect();

plt.close("all")

#plt.pause(.01)
plt.pause(1)

print("Technically the programme would continue with more calculations.")
print("Notice how the memory won't be freed however until the entire script is finished.")
import time
time.sleep(120)
print("Now the programme exits")

Does this actually work for you @tacaswell ? When running python from the terminal, the memory is not freed.

@tacaswell
Copy link
Member

matplotlib.use('Qt5Agg') ## likely requires: pip install PyQt5

This is using the Qt5Agg backend, I mean

matplotlib.use('Agg')

@spacescientist
Copy link
Author

spacescientist commented Oct 18, 2023

Again, sorry: by accident I did send my previous reply too quickly.

But in fact, while it does not work with 'Agg' (see above), it seemed more promising with 'Qt5Agg' though it is not reliable.

This is why I was then confused.

Qt5Agg

if True:
    import matplotlib.pyplot as plt
    import numpy as np

    import matplotlib
    matplotlib.use('Qt5Agg') ## likely requires: pip install PyQt5
    # %matplotlib inline

    da = np.sort(np.random.random(int(3e6))) # Beware: please lower this if your system has less than 32 GB

    def custom_plot(da, **kwargs):
        """da is a 1-D ordered xarray.DataArray or numpy array (containing tons of data)"""
        plt.yscale('log')
        n = len(da)
        return plt.plot(da, 1 - (np.arange(1, n+1) / (n+1)), **kwargs);

    def resampling_method(da, case):
        """
        A complex thing in reality but for the MWE, let us simply return da itself.
        It will lead to the same memory problem.
        """
        return da

    plt.figure(figsize=(15, 8), dpi=150)
    plt.ylim((1e-10,1))

    for case in np.arange(50):
        custom_plot(resampling_method(da, case)) # each time getting the curve for a different resampling of da
    custom_plot(da) # curve for the original da

    plt.savefig("output.png")
    #plt.show()
    plt.show(block=True)

    import gc
    gc.collect();

    plt.close("all")

    #plt.pause(.01)
    plt.pause(1)

    print("Technically the programme would continue with more calculations.")
    print("Notice how the memory won't be freed however until the entire script is finished.")
    import time
    time.sleep(120)
    print("Now the programme exits")

In Jupyter it consistently frees the memory right after displaying the figure.

With python from the CLI however, it is not reliable enough:

  • it most often frees the memory if I do Alt+F4 without interacting at all with the figure.
  • however, the memory is not freed sometimes (it seems to happen if I interact with the figure in any way (e.g. click on it, or change its size) before closing it). It is hard to identify exactly what triggers this, it might of course be that it is not caused by interacting with the figure, but due to something else.

Still: what I can with certainty is that it still does not always frees the memory either.

NB: I moreover tried moving where gc is being used exactly; it didn't seem to work either.

@spacescientist
Copy link
Author

So far, the only reliable solution that does not lead to memory problems in either Python from CLI nor Jupyter is the hack number 3, which I mentioned when opening the issue.

@tacaswell
Copy link
Member

If I move gc.collect() to after plt.pause() but before time.sleep() it releases the memory for me with 'agg'

time.sleep(120) is not actually a good stand in for "doing other work" because it is 1 byte code and then a c-function so it will not trip any of Python's automatic gc logic! See https://devguide.python.org/internals/garbage-collector/ for details about how that works, in particular https://devguide.python.org/internals/garbage-collector/#collecting-the-oldest-generation. The TL;DR is that you need to create enough container objects that live long enough the interpreter decides it is worth running the generation 2 sweep. With a sleep (or a simple hot-loop like 1+1) that does not happen and so the unreachable objects holding the large arrays never get cleaned up (noteable this was the same issue with #23712 and how we had unbounded growth). By using a slightly pathological test function that does nothing but make circularly referenced containers we can force the issue and see the big chunk of memory drop off almost immediately.

import matplotlib.pyplot as plt
import numpy as np

import matplotlib

matplotlib.use("Agg")  ## likely requires: pip install PyQt5
# %matplotlib inline

da = np.sort(
    np.random.random(int(3e6))
)  # Beware: please lower this if your system has less than 32 GB


def custom_plot(da, **kwargs):
    """da is a 1-D ordered xarray.DataArray or numpy array (containing tons of data)"""
    plt.yscale("log")
    n = len(da)
    return plt.plot(da, 1 - (np.arange(1, n + 1) / (n + 1)), **kwargs)


def resampling_method(da, case):
    """
    A complex thing in reality but for the MWE, let us simply return da itself.
    It will lead to the same memory problem.
    """
    return da


plt.figure(figsize=(15, 8), dpi=150)
plt.ylim((1e-10, 1))

for case in np.arange(50):
    custom_plot(
        resampling_method(da, case)
    )  # each time getting the curve for a different resampling of da
custom_plot(da)  # curve for the original da

plt.savefig("output.png")
# plt.show()
plt.show(block=True)

import gc

gc.collect()

plt.close("all")

# plt.pause(.01)
plt.pause(1)

print("Technically the programme would continue with more calculations.")
print(
    "Notice how the memory won't be freed however until the entire script is finished."
)


def test():
    d = {}
    for j in range(50):
        d[j] = {"parent": d}
    return d


import time

deadline = time.monotonic() + 150
while time.monotonic() < deadline:
    test()

print("Now the programme exits")

@spacescientist
Copy link
Author

spacescientist commented Oct 19, 2023

Indeed, thank you. However the fact that using "Agg" doesn't always free the memory despite the use of:

  • not only plot.show()
  • but also gc.collect()
  • as well as plt.close("all")
  • to which we should add plt.pause(1)
  • and still remains dependent of what code exactly follows

really looks more like a bug than a reasonable behaviour.

Luckily, in comparison, following your earlier comments, I have found that switching to the backend used by default in Jupyter — even in Python with CLI — is actually very satisfactory.

### Switch to the backend used by default in Jupyter.

### The best: it systematically frees the memory.

if True:
# if False:
    import matplotlib.pyplot as plt
    import numpy as np

    import matplotlib
    matplotlib.use('module://matplotlib_inline.backend_inline') # < USE ME
    print(matplotlib.get_backend())
    # %matplotlib inline

    da = np.sort(np.random.random(int(3e6))) # Beware: please lower this if your system has less than 32 GB

    def custom_plot(da, **kwargs):
        """da is a 1-D ordered xarray.DataArray or numpy array (containing tons of data)"""
        plt.yscale('log')
        n = len(da)
        return plt.plot(da, 1 - (np.arange(1, n+1) / (n+1)), **kwargs);

    def resampling_method(da, case):
        """
        A complex thing in reality but for the MWE, let us simply return da itself.
        It will lead to the same memory problem.
        """
        return da

    plt.figure(figsize=(15, 8), dpi=150)
    plt.ylim((1e-10,1))

    for case in np.arange(50):
        custom_plot(resampling_method(da, case)) # each time getting the curve for a different resampling of da
    custom_plot(da) # curve for the original da

    plt.savefig("output.png")
    
    ## one of these is needed with Jupyter; while Python is also fine with plt.close():
#     plt.show(block=True)
    plt.show()

    ## absolutely needed:
    import gc
    gc.collect();

    ## not necessary with default jupyter backend:
#     plt.close("all")
#     plt.pause(1)

    print("Technically the programme would continue with more calculations.")
#     print("Notice how the memory won't be freed however until the entire script is finished.")
    print("With the default jupyter backend: success!")
    import time
    time.sleep(60)
    print("Now the programme exits")

It requires less code: no plt.close("all") nor plt.pause(1) is needed besides gc.collect().

And everything is properly cleaned as expected as soon as possible. Even if one writes a mere sleep function, it matters not: it simply works wonderfully both in Jupyter and as a python script executed from the terminal.

@tacaswell
Copy link
Member

The reason it looks like it words is you are manually calling gc.collect().

If you do not like this behavior I suggest you take it up with CPython ;)

@spacescientist
Copy link
Author

Yes, maybe I should. Coming from C++, this is very surprising to me.

That Jupyter aggressively closes figures did help me find what I believe is a nice solution in the meanwhile.
I am just not sure at all why you wouldn't want this behaviour by default, especially as soon as e.g plt.close() is called.

Thanks for your time!

@tacaswell
Copy link
Member

I am just not sure at all why you wouldn't want this behaviour by default,

Because running the garbage collector can be very expensive it terms of run time it has to freeze the world to do it. If you have a lot of other long lived objects you will waste a lot of time for possibly very small gains in how promptly memory is released.

@spacescientist
Copy link
Author

I didn't mean calling the garbage collector, I meant freeing the memory in a clean way as soon as it makes sense. It seems to me that it shouldn't remain in memory if not explicitly requested by the programmer.

In C++ (e.g. rule of zero + STL containers or smart pointers), you can have the destructor called automatically and everything cleaned up as soon as the objects go out of scope — no need for a garbage collector.

In this matplotlib case, objects appear to be longer lived that they reasonably should.

Modified MWE, with scope

Here's the MWE but where everything related to the figure is put inside of a custom function memory_test() which does not return anything and which does show the plot before exiting. Even though the programmer is making no reference to the plot outside of memory_test(), the memory won't usually be freed by matplotlib when the function is exited.

import matplotlib.pyplot as plt
import matplotlib
import numpy as np

def memory_issues_demo():

	da = np.sort(np.random.random(int(1e6)))

	def custom_plot(da, **kwargs):
	    """da is a 1-D ordered xarray.DataArray or numpy array (containing tons of data)"""
	    plt.yscale('log')
	    n = len(da)
	    return plt.plot(da, 1 - (np.arange(1, n+1) / (n+1)), **kwargs);

	def resampling_method(da, case):
	    """
	    A complex thing in reality but for the MWE, let us simply return da itself.
	    It will lead to the same memory problem.
	    """
	    return da

	plt.figure(figsize=(15, 8), dpi=150)
	plt.ylim((1e-10,1))

	for case in np.arange(50):
	    custom_plot(resampling_method(da, case)) # each time getting the curve for a different resampling of da
	custom_plot(da) # curve for the original da
	
	print("Displaying the plot in a window to the user.")
	plt.show()
	print()
	
	print("The window has been closed by the user.")
	
	# Optional: not freed consistenly even when using gc manually:
	if True:
		import gc
		gc.collect()
		print("Optional extra step: moreover explicitly called gc.collect().")

print("Memory issues: not consistently freed.")
print("--------------------------------------")

print("Entering the subroutine.")
memory_issues_demo()
print("Subroutine exited.")
print()

# Expectation: what is inside the function is now out of scope
# Example: da is out of scope
#print(da)

print("Nothing from what the programmer wrote indicates that that specific plot should live on (no explicit reference to the plot outside of the function, and the window was closed before going out of scope).")
print()

print("Still, the original figure lives on in memory for some reason...")
print("Notice how the memory isn't going to be freed until the entire script is finished.")
print()


print("Let us do some more work")
memory_issues_demo()

print("Now the programme exits.")

@jklymak
Copy link
Member

jklymak commented Oct 21, 2023

Those print statements do not take any time at all. I'm not sure how you are diagnosing that the memory is not being freed "soon" after your method has finished.

@tacaswell
Copy link
Member

Objects can not be freed until all references to them are gone or you get segfaults (just as if you use new/malloc, hand a pointer off in c++, then delete/`dealloc, and then what ever is holding that pointer tries to use it). CPython does this with reference counting (pypy does it differently) and will free the memory when the count goes to 0 . If there are circular references, then you need the garbage collector to break the cycles. All allocations in CPython are effectively heap (as an implementation detail CPython maintains a PyObject pool) and there is no concept of stack allocations.

The upside of Python managing memory for you (via reference counting) is that segfaulting via a pre-maturely dellocated pointer is not possible (from Python...you can still introduce segfaults from c-extensions), but a cost is you lose control of exactly when memory is freed.

@spacescientist
Copy link
Author

spacescientist commented Oct 22, 2023

@jklymak

Those print statements do not take any time at all. I'm not sure how you are diagnosing that the memory is not being freed "soon" after your method has finished.

Well, it's quite simple, really.
For this, you should of course look at the memory load (with top, htop, task manager, whatever you like)...

The key is that there are 2 function calls in this code:
memory_issues_demo() [A]
and then again later
memory_issues_demo() # [B]
before the script finally ends [C].

memory_load

If you monitor the RAM load, you will see [between A and B] that the memory will ramp up as the first plot is being made, followed by a plateau while the plot is being displayed (so far, so good).

Once the window is closed [B] however, the memory is usually not freed at all. And this happens while we actually even left the function (da, which was used to plot, is notably completely out of scope, mind you).

Nonetheless, you keep the same heavy memory load as when it was being displayed.

Now we move on: while the memory from the first plot has still not been freed, even though we left the function and did not make any attempt to keep the plot alive (e.g. by explicitly assigning it to a variable that lives on), we now call the function a second time [between B and C]. On top of the memory load from the first plot, you then see the memory ramp up again, leading to an even higher plateau, making the problem worse and worse, the more matplotlib is used.

The memory is only freed when the entire script is finished [C].

The MWE is there, just try it out. And if you want to see the issue even more dramatically, just execute the function a few more times in a row, say 10 or 20 times. Your RAM and swap will soon be filled in completely, leading to a crash.

@spacescientist
Copy link
Author

spacescientist commented Oct 22, 2023

@tacaswell I thought that you understood what I meant, but maybe you didn't.

I can only imagine, ever since I opened the issue, that this situation is due to the internals of matplotlib, which are keeping a reference to the plot and refuse to let it go, even when the programmer never ever asked for this, and even when closing is explicitly requested. BTW: for good rationale regarding memory management, Rust is a good inspiration.

There are many Python libraries able to deal with tons of data without leading to memory problems (e.g. numpy, pandas, xarray, ...); this problem really seems matplotlib specific.

What I have been hinting at is that I believe that this really ought be reconsidered/addressed.

From your answers, it seemed that you were not interested however, which is why I closed the issue (as soon as I found that the default backend used in jupyter is at least such that the memory is correctly freed when it ought to be).

(edit: italic & bold are used for emphasis; you're the first I ever see on the internet who confuses this with all caps......)

@tacaswell tacaswell added Community support Users in need of help. and removed status: needs clarification Issues that need more information to resolve. labels Oct 27, 2023
@tacaswell
Copy link
Member

@spacescientist I understand what you are saying correctly. I understand you are frustrated and confused, but please to not "yell" an me with bold text. We take memory leaks very seriously and fix them when ever we find them, however you have not identified any actually unexpected behavior or leak.

To try and summarize the state of things:

  1. if you do plt.figure (or anything that calls that for you under the hood) you are explicitly asking Matplotlib to keep a hard reference to the Figure object. This is logically required for the pyplot api to work as if plt.plot(...) called from anywhere in your code is going to add a line to the plot then we must have some global state around! If you do not want to opt-in to that global state, then do not use pyplot.
  2. If you additionally create a GUI window associated with the Figure then there is some additional global state that must be created because the GUI toolkit keeps track of its objects, we keep track of the widget we are drawing to + the toolbar, and UI callbacks need to know about Matplotlib objects. Most GUI event loops demand to run on the main thread, the Python interpreter also runs on the main thread (unless you do embedding, but that is its own can of worms) so they have to come to some sort of time-share agreement. Your options are to: a: let Python run the show (but then your GUI is non responsive / might not even show you the plot) b: let the GUI run the show (but then your script block on where ever you start the event loop in your code...you can use the event loop to call back into Python but then you are writing a GUI application not a script) c: find some way to let them share (either using PyOS_Inputhook / prompt toolkit's input hook / ... to let the UI run while waiting for user input or manually run the event loop in bursts). Because the GUI toolkits (notable mostly written in c++) need to be careful about how the teardown widgets, closing a window does not immediately call the destructor but puts some "I'm about to close" events on the event loop that need to be flushed. If you close the figure but don't run the event loop you may end up with objects living until you run the event loop for long enough for them to go down (maybe we could add a flush_events to our "shut the event loop down" code, but I would be skeptical that it would always fix the problem because those APIs tend to only process events that were pending when you called them so they will always run in bounded time (so processing an event can not add an event to the queue and keep processing going)). Running the event loop will eventually sort this out but until the c++ side sorts itself out, the Python side may be kept alive.
  3. Internal to the Figure the Artists are a double-linked tree (nodes know about both their parents and their children) so we have a significant number of circular references.
  4. Python manages the memory for you. There is no way to force Python to deallocate an object (see the warnings in https://docs.python.org/3/reference/datamodel.html#object.__del__). The upside is that there is a whole class of "use after deallocation segfaults" bugs which you can not have from Python (you can cause them with extension modules) but the trade off is you (the user) lose fine grain control of when the destructor is called. My understanding is that "as soon as refcount hits 0" is an implementation detail of CPython and one I suspect is going to get mushy with the no-GIL work that is moving forward (my understanding is that it does per-thread reference counting with occasional consolidation so you could have two threads each with a ref that both drop them (so no hard refs left) but the object won't get deleted until the two threads reconcile their counts, but I have only been following this work on the edges so treat that as speculation).
  5. To deal with circular references CPython uses a garbage collector (this is not part of the Python spec and is considered an implementation detail...pypy does it differently) to detect and break cycles. However running the collection process can be (time wise) expensive so CPython uses some heuristics to determine when to do the sweeps along with a "generation" scheme to do less work but clear "short lived" objects. We used to call gc.collect() in our close figure logic but this was putting pathological delays in user code that had many long lived objects. We then tried doing gc.collect(1) to split the difference and that caused unbounded memory usage in some cases. Now we do not call gc.collect at all and leave that to the interpreter (which in turn provide public API for the user to override the heuristics).
  6. we make a copy of the data passed into plot (as a loop that mutates and plots the same numpy array multiple times will break in surprising ways!) so the lifetime of da is a red herring.

If I simplify out the extra indirection and run both the event loop and gc.collect:

import gc

import matplotlib.pyplot as plt
import numpy as np


def memory_issues_demo():
    gc.collect()
    plt.pause(2)

    N = 1e6

    da = np.sort(np.random.random(int(N)))

    plt.figure()

    for j in np.arange(51):
        plt.yscale("log")
        plt.plot(da, (j + 1) * (1 - (np.arange(1, N + 1) / (N + 1))))

    tmr = plt.gcf().canvas.new_timer(
        interval=5000, callbacks=[(lambda: plt.close("all"), (), {})]
    )
    tmr.single_shot = True
    tmr.start()
    plt.show()


for j in range(50):
    memory_issues_demo()

this will release all of its memory every time through. If you move your mouse through we will keep an extra hard ref (deprecated in #25101 and I have a PR queued up to finish the removal) to one figure, but there will not be run-away memory usage.

The inline backend does not provide any interactivity so you avoid (2) from above.

@tacaswell tacaswell closed this as not planned Won't fix, can't repro, duplicate, stale Oct 27, 2023
@marty-sullivan
Copy link

Another potential solution for this is to use a ProcessPoolExecutor (even with a single process) to loop over plots. This will work even better than gc.collect() in many cases.

An example might look like the following, your create_plot function could output, for example, a png image of the plot and then all references to pyplot resources will be fully freed when each subprocess is terminated.

from concurrent.futures import ProcessPoolExecutor
from PIL import Image

def create_plot(data) -> Image.Image:
    ...

    return img

my_list_of_data = [...]

with ProcessPoolExecutor(max_workers=...) as executor:
    images = executor.map(create_plot, my_list_of_data)

for image in images:
    image.show()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Community support Users in need of help.
Projects
None yet
Development

No branches or pull requests

4 participants