-
-
Notifications
You must be signed in to change notification settings - Fork 7.9k
[Bug]: Memory not freed as expected after plotting heavy plot involving looping #27138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Please provide a reproducible example. For instance how many cases do you have and how large is each one? Thanks. |
@jklymak OK, I have added fake data. I have tested the updated script and it should work out of the box as it stands now. Of course, you won't see that the RAM is not freed before the end of the script if the script ends at soon as the figure is closed or displayed. In reality there are more calculations after that. EDIT: I have therefore added some time.sleep in my MWE toy script after the garbage collection, so that you can see that the memory load remains long after the image has been shown/closed, as it does in reality. Thanks a lot for your time. Please beware if your RAM is lower than 32 GB. With 3e6, I get a usage of a bit more than 17 GB. Since you added 'needs clarification': the answer is ~3 million floats and ~50 samples. You'd get the same for either more or less than that. It simply always happens: no matter how much data I consider (but it's easier to see when it's a lot of data). |
TL;DR Either using the I disagree that this is distinct from #20300, in the case where you are using a GUI backend it is exactly the same problem. I think part of the issue here is a confusion over the internal structure of Because Python manages memory for us it won't be freed until Python agrees that there are no references to the objects left. If you create the Figure via To solve this you have two options. The first is to make sure that there are no hard references to the objects that you do not control (so that the scoping / lifetime rules of Python apply as you expect!). The easiest way to do that is to not use pyplot. This example: https://matplotlib.org/stable/gallery/user_interfaces/web_application_server_sgskip.html#embedding-in-a-web-application-server-flask is written for using Matplotlib inside of a webserver, but that is a stand in for a long-running process that will generate lots of figures a human will never loop at (in that process). The other option is to use a non-interactive backend that will not create any extraneous GUI related objects. For the interactive backends there are even more circular references because we need to hook up callbacks to the UI event system so that we can correctly react to user input by changing the Figure (pan / zoom / mouse over coords, picking, etc). https://github.com/matplotlib/mpl-gui is some prototype work on how to make some of this easier to manage. |
@tacaswell From that issue, I did read: However in the case of this issue, I do want to show the figures. The issue does take place when using |
That being said, I am of course going to read in more detail what you explain (I am far from being a matplotlib expert). I really do appreciate that you have taken the time to write this, but I'm just afraid that you might too hastily close this — which is why I've wanted to react quickly to your comment. |
The key detail is that the GUI event loop needs to be allowed to run (see https://matplotlib.org/stable/users/explain/figure/interactive_guide.html). If you do Adding a |
EDIT: My apologies, I was too quick in my reply: so I have retracted my comment and decided to further check all this. So I have tried what you just said (though I did already make some tests with these backends myself before opening this issue): using the 'Agg' backend with pyplot + plt.close('all')Does not work. I have also tried adding import matplotlib.pyplot as plt
import numpy as np
import matplotlib
matplotlib.use('Agg')
# %matplotlib inline
da = np.sort(np.random.random(int(3e6))) # Beware: please lower this if your system has less than 32 GB
def custom_plot(da, **kwargs):
"""da is a 1-D ordered xarray.DataArray or numpy array (containing tons of data)"""
plt.yscale('log')
n = len(da)
return plt.plot(da, 1 - (np.arange(1, n+1) / (n+1)), **kwargs);
def resampling_method(da, case):
"""
A complex thing in reality but for the MWE, let us simply return da itself.
It will lead to the same memory problem.
"""
return da
plt.figure(figsize=(15, 8), dpi=150)
plt.ylim((1e-10,1))
for case in np.arange(50):
custom_plot(resampling_method(da, case)) # each time getting the curve for a different resampling of da
custom_plot(da) # curve for the original da
plt.savefig("output.png")
#plt.show()
plt.show(block=True)
import gc
gc.collect();
plt.close("all")
#plt.pause(.01)
plt.pause(1)
print("Technically the programme would continue with more calculations.")
print("Notice how the memory won't be freed however until the entire script is finished.")
import time
time.sleep(120)
print("Now the programme exits") Does this actually work for you @tacaswell ? When running python from the terminal, the memory is not freed. |
This is using the Qt5Agg backend, I mean
|
Again, sorry: by accident I did send my previous reply too quickly. But in fact, while it does not work with 'Agg' (see above), it seemed more promising with 'Qt5Agg' though it is not reliable. This is why I was then confused. Qt5Aggif True:
import matplotlib.pyplot as plt
import numpy as np
import matplotlib
matplotlib.use('Qt5Agg') ## likely requires: pip install PyQt5
# %matplotlib inline
da = np.sort(np.random.random(int(3e6))) # Beware: please lower this if your system has less than 32 GB
def custom_plot(da, **kwargs):
"""da is a 1-D ordered xarray.DataArray or numpy array (containing tons of data)"""
plt.yscale('log')
n = len(da)
return plt.plot(da, 1 - (np.arange(1, n+1) / (n+1)), **kwargs);
def resampling_method(da, case):
"""
A complex thing in reality but for the MWE, let us simply return da itself.
It will lead to the same memory problem.
"""
return da
plt.figure(figsize=(15, 8), dpi=150)
plt.ylim((1e-10,1))
for case in np.arange(50):
custom_plot(resampling_method(da, case)) # each time getting the curve for a different resampling of da
custom_plot(da) # curve for the original da
plt.savefig("output.png")
#plt.show()
plt.show(block=True)
import gc
gc.collect();
plt.close("all")
#plt.pause(.01)
plt.pause(1)
print("Technically the programme would continue with more calculations.")
print("Notice how the memory won't be freed however until the entire script is finished.")
import time
time.sleep(120)
print("Now the programme exits") In Jupyter it consistently frees the memory right after displaying the figure. With python from the CLI however, it is not reliable enough:
Still: what I can with certainty is that it still does not always frees the memory either. NB: I moreover tried moving where gc is being used exactly; it didn't seem to work either. |
So far, the only reliable solution that does not lead to memory problems in either Python from CLI nor Jupyter is the hack number 3, which I mentioned when opening the issue. |
If I move
import matplotlib.pyplot as plt
import numpy as np
import matplotlib
matplotlib.use("Agg") ## likely requires: pip install PyQt5
# %matplotlib inline
da = np.sort(
np.random.random(int(3e6))
) # Beware: please lower this if your system has less than 32 GB
def custom_plot(da, **kwargs):
"""da is a 1-D ordered xarray.DataArray or numpy array (containing tons of data)"""
plt.yscale("log")
n = len(da)
return plt.plot(da, 1 - (np.arange(1, n + 1) / (n + 1)), **kwargs)
def resampling_method(da, case):
"""
A complex thing in reality but for the MWE, let us simply return da itself.
It will lead to the same memory problem.
"""
return da
plt.figure(figsize=(15, 8), dpi=150)
plt.ylim((1e-10, 1))
for case in np.arange(50):
custom_plot(
resampling_method(da, case)
) # each time getting the curve for a different resampling of da
custom_plot(da) # curve for the original da
plt.savefig("output.png")
# plt.show()
plt.show(block=True)
import gc
gc.collect()
plt.close("all")
# plt.pause(.01)
plt.pause(1)
print("Technically the programme would continue with more calculations.")
print(
"Notice how the memory won't be freed however until the entire script is finished."
)
def test():
d = {}
for j in range(50):
d[j] = {"parent": d}
return d
import time
deadline = time.monotonic() + 150
while time.monotonic() < deadline:
test()
print("Now the programme exits") |
Indeed, thank you. However the fact that using "Agg" doesn't always free the memory despite the use of:
really looks more like a bug than a reasonable behaviour. Luckily, in comparison, following your earlier comments, I have found that switching to the backend used by default in Jupyter — even in Python with CLI — is actually very satisfactory. ### Switch to the backend used by default in Jupyter.
### The best: it systematically frees the memory.
if True:
# if False:
import matplotlib.pyplot as plt
import numpy as np
import matplotlib
matplotlib.use('module://matplotlib_inline.backend_inline') # < USE ME
print(matplotlib.get_backend())
# %matplotlib inline
da = np.sort(np.random.random(int(3e6))) # Beware: please lower this if your system has less than 32 GB
def custom_plot(da, **kwargs):
"""da is a 1-D ordered xarray.DataArray or numpy array (containing tons of data)"""
plt.yscale('log')
n = len(da)
return plt.plot(da, 1 - (np.arange(1, n+1) / (n+1)), **kwargs);
def resampling_method(da, case):
"""
A complex thing in reality but for the MWE, let us simply return da itself.
It will lead to the same memory problem.
"""
return da
plt.figure(figsize=(15, 8), dpi=150)
plt.ylim((1e-10,1))
for case in np.arange(50):
custom_plot(resampling_method(da, case)) # each time getting the curve for a different resampling of da
custom_plot(da) # curve for the original da
plt.savefig("output.png")
## one of these is needed with Jupyter; while Python is also fine with plt.close():
# plt.show(block=True)
plt.show()
## absolutely needed:
import gc
gc.collect();
## not necessary with default jupyter backend:
# plt.close("all")
# plt.pause(1)
print("Technically the programme would continue with more calculations.")
# print("Notice how the memory won't be freed however until the entire script is finished.")
print("With the default jupyter backend: success!")
import time
time.sleep(60)
print("Now the programme exits") It requires less code: no And everything is properly cleaned as expected as soon as possible. Even if one writes a mere sleep function, it matters not: it simply works wonderfully both in Jupyter and as a python script executed from the terminal. |
The reason it looks like it words is you are manually calling If you do not like this behavior I suggest you take it up with CPython ;) |
Yes, maybe I should. Coming from C++, this is very surprising to me. That Jupyter aggressively closes figures did help me find what I believe is a nice solution in the meanwhile. Thanks for your time! |
Because running the garbage collector can be very expensive it terms of run time it has to freeze the world to do it. If you have a lot of other long lived objects you will waste a lot of time for possibly very small gains in how promptly memory is released. |
I didn't mean calling the garbage collector, I meant freeing the memory in a clean way as soon as it makes sense. It seems to me that it shouldn't remain in memory if not explicitly requested by the programmer. In C++ (e.g. rule of zero + STL containers or smart pointers), you can have the destructor called automatically and everything cleaned up as soon as the objects go out of scope — no need for a garbage collector. In this matplotlib case, objects appear to be longer lived that they reasonably should. Modified MWE, with scopeHere's the MWE but where everything related to the figure is put inside of a custom function import matplotlib.pyplot as plt
import matplotlib
import numpy as np
def memory_issues_demo():
da = np.sort(np.random.random(int(1e6)))
def custom_plot(da, **kwargs):
"""da is a 1-D ordered xarray.DataArray or numpy array (containing tons of data)"""
plt.yscale('log')
n = len(da)
return plt.plot(da, 1 - (np.arange(1, n+1) / (n+1)), **kwargs);
def resampling_method(da, case):
"""
A complex thing in reality but for the MWE, let us simply return da itself.
It will lead to the same memory problem.
"""
return da
plt.figure(figsize=(15, 8), dpi=150)
plt.ylim((1e-10,1))
for case in np.arange(50):
custom_plot(resampling_method(da, case)) # each time getting the curve for a different resampling of da
custom_plot(da) # curve for the original da
print("Displaying the plot in a window to the user.")
plt.show()
print()
print("The window has been closed by the user.")
# Optional: not freed consistenly even when using gc manually:
if True:
import gc
gc.collect()
print("Optional extra step: moreover explicitly called gc.collect().")
print("Memory issues: not consistently freed.")
print("--------------------------------------")
print("Entering the subroutine.")
memory_issues_demo()
print("Subroutine exited.")
print()
# Expectation: what is inside the function is now out of scope
# Example: da is out of scope
#print(da)
print("Nothing from what the programmer wrote indicates that that specific plot should live on (no explicit reference to the plot outside of the function, and the window was closed before going out of scope).")
print()
print("Still, the original figure lives on in memory for some reason...")
print("Notice how the memory isn't going to be freed until the entire script is finished.")
print()
print("Let us do some more work")
memory_issues_demo()
print("Now the programme exits.") |
Those print statements do not take any time at all. I'm not sure how you are diagnosing that the memory is not being freed "soon" after your method has finished. |
Objects can not be freed until all references to them are gone or you get segfaults (just as if you use The upside of Python managing memory for you (via reference counting) is that segfaulting via a pre-maturely dellocated pointer is not possible (from Python...you can still introduce segfaults from c-extensions), but a cost is you lose control of exactly when memory is freed. |
Well, it's quite simple, really. The key is that there are 2 function calls in this code: If you monitor the RAM load, you will see [between A and B] that the memory will ramp up as the first plot is being made, followed by a plateau while the plot is being displayed (so far, so good). Once the window is closed [B] however, the memory is usually not freed at all. And this happens while we actually even left the function ( Nonetheless, you keep the same heavy memory load as when it was being displayed. Now we move on: while the memory from the first plot has still not been freed, even though we left the function and did not make any attempt to keep the plot alive (e.g. by explicitly assigning it to a variable that lives on), we now call the function a second time [between B and C]. On top of the memory load from the first plot, you then see the memory ramp up again, leading to an even higher plateau, making the problem worse and worse, the more matplotlib is used. The memory is only freed when the entire script is finished [C]. The MWE is there, just try it out. And if you want to see the issue even more dramatically, just execute the function a few more times in a row, say 10 or 20 times. Your RAM and swap will soon be filled in completely, leading to a crash. |
@tacaswell I thought that you understood what I meant, but maybe you didn't. I can only imagine, ever since I opened the issue, that this situation is due to the internals of matplotlib, which are keeping a reference to the plot and refuse to let it go, even when the programmer never ever asked for this, and even when closing is explicitly requested. BTW: for good rationale regarding memory management, Rust is a good inspiration. There are many Python libraries able to deal with tons of data without leading to memory problems (e.g. numpy, pandas, xarray, ...); this problem really seems matplotlib specific. What I have been hinting at is that I believe that this really ought be reconsidered/addressed. From your answers, it seemed that you were not interested however, which is why I closed the issue (as soon as I found that the default backend used in jupyter is at least such that the memory is correctly freed when it ought to be). (edit: italic & bold are used for emphasis; you're the first I ever see on the internet who confuses this with all caps......) |
@spacescientist I understand what you are saying correctly. I understand you are frustrated and confused, but please to not "yell" an me with bold text. We take memory leaks very seriously and fix them when ever we find them, however you have not identified any actually unexpected behavior or leak. To try and summarize the state of things:
If I simplify out the extra indirection and run both the event loop and import gc
import matplotlib.pyplot as plt
import numpy as np
def memory_issues_demo():
gc.collect()
plt.pause(2)
N = 1e6
da = np.sort(np.random.random(int(N)))
plt.figure()
for j in np.arange(51):
plt.yscale("log")
plt.plot(da, (j + 1) * (1 - (np.arange(1, N + 1) / (N + 1))))
tmr = plt.gcf().canvas.new_timer(
interval=5000, callbacks=[(lambda: plt.close("all"), (), {})]
)
tmr.single_shot = True
tmr.start()
plt.show()
for j in range(50):
memory_issues_demo() this will release all of its memory every time through. If you move your mouse through we will keep an extra hard ref (deprecated in #25101 and I have a PR queued up to finish the removal) to one figure, but there will not be run-away memory usage. The inline backend does not provide any interactivity so you avoid (2) from above. |
Another potential solution for this is to use a ProcessPoolExecutor (even with a single process) to loop over plots. This will work even better than An example might look like the following, your from concurrent.futures import ProcessPoolExecutor
from PIL import Image
def create_plot(data) -> Image.Image:
...
return img
my_list_of_data = [...]
with ProcessPoolExecutor(max_workers=...) as executor:
images = executor.map(create_plot, my_list_of_data)
for image in images:
image.show() |
Solution
#27138 (comment)
Bug summary
I work with a large 1D dataset. For a statistical analysis, I am using the bootstrap method, resampling it many times.
I am interested in looping over all cases in order to put together on a single figure a specific result for all resamplings.
Memory issues take place though (e.g. not freed before the very end of the script, or even leaks).
Here I document some things that at least partially address the issue. None is fully satisfactory though.
I am running the same script both from Python and as a Jupyter notebook (synchronised via jupytext). I am trying to get rid of the memory issues in both cases (the RAM usage easily reaches 16–32 GB once I start playing with enough data).
Code for reproduction
Actual outcome
Memory issues are taking place no matter what I've tried so far. Depending on what is being attempted, it can lead to the memory either not being freed after the plot has been shown/is closed, or even memory leaks and massive swap usage.
Expected outcome
Memory freed well before the end of the programme. I would expect it to be freed soon after the figure is closed.
Additional information
NB: I did also try many other things (incl.
plt.cla
and the like), as well as changing backend (notably "Agg" and "Qt5Agg") but that did not solve the problem in the slightest, so I won't document them.Things that have some effect
plt.show()
block=False
,time.sleep
andclose('all')
, the memory will be freed after the plot has been created both with Jupyter and Python. However, in Python, a window will be created (stealing focus) and nothing will ever appear in it (it will be closed after 5 seconds). It'd therefore be tempting to comment outplt.show(block=False)
but if you do, Jupyter will no longer clear the memory...With this:
Both will clean the memory after that figure has been closed (or 5 seconds after rather).
This is the most satisfactory one... not exactly nice though.
Should one want to have the figure displayed when running python from CLI however, I haven't found a method
where the memory wouldn't remain in use until the very end of the entire script.
Some further notes:
There are known memory issues with matplotlib and looping, such as: http://datasideoflife.com/?p=1443
but here I do not create a at each iteration of a loop, but accumulate plots from a loop and plot the end result. The solution put forward there (i.e. use
plt.close(fig)
does not work in this case).This is also distinct from Memory leak in plt.close() when unshown figures in GUI backends are closed #20300
Operating system
Ubuntu
Matplotlib Version
3.7.3
Matplotlib Backend
module://matplotlib_inline.backend_inline (default)
Python version
3.8.10
Jupyter version
6.5.2
Installation
pip
The text was updated successfully, but these errors were encountered: