Skip to content

FIX fixes memory leak seen in PyPy in C losses #27670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 31, 2023

Conversation

glemaitre
Copy link
Member

partially addresses #27662

Avoid to use np.asarray that creates a reference to the memory view and seems to not be garbage collected in PyPy.

@github-actions
Copy link

github-actions bot commented Oct 26, 2023

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 32e9f17. Link to the linter CI: here

@lesteve
Copy link
Member

lesteve commented Oct 26, 2023

PyPy tests go until 98%, almost there!

@betatim
Copy link
Member

betatim commented Oct 27, 2023

Is the plan to merge this even before we get to 100%? From a quick look at the diff it looks uncontroversial. I think technically it changes the API of the loss functions (they now return None) but I think they are internal so it isn't something we need to worry about.

@lorentzenchr do you want to have a look and think about this (I think you wrote a lot of this code?)?

@lorentzenchr
Copy link
Member

We can remove those return statements, I guess the only use is testing and the Python loss.py submodule.
I had them there because I don’t like Fortran style coding and it made using those functions more Pythonic.

The docstrings need to be modified, too.

@betatim
Copy link
Member

betatim commented Oct 30, 2023

I assume the fortran style is needed to make sure we re-use the same array/memory each time and results in a significant performance improvement. Is that the right assumption? I don't know enough about the history of this code and the discussions around it, so if this has been discussed before: I don't want to re-start an old discussion, just check that someone thought about this topic. Because I agree having a more pythonic API would be nicer.

@lorentzenchr
Copy link
Member

I assume the fortran style is needed to make sure we re-use the same array/memory each time and results in a significant performance improvement.

We want to achieve ufunc like behaviour, e.g. np.add(a, b, out=c) and provide array c instead of creating it each time (c can be element-wise loss or a gradient). This saves memory and the time to allocate that memory each time. It also suits perfectly in our fitting alogs, where we usually only need the current gradient (and sometimes the previous one, but not more).

@glemaitre
Copy link
Member Author

This saves memory and the time to allocate that memory each time

I also like the fact that we allocate the memory via NumPy in Python usually and just interact with it. We don't have to free the buffer ourselves.

glemaitre and others added 2 commits October 30, 2023 11:44
Co-authored-by: Tim Head <betatim@gmail.com>
Co-authored-by: Tim Head <betatim@gmail.com>
@lesteve lesteve merged commit a5fed0d into scikit-learn:main Oct 31, 2023
@lesteve
Copy link
Member

lesteve commented Oct 31, 2023

Merging, thanks! Let's see if the PyPy build manages to complete from time to time with this fix 🤞 !

REDVM pushed a commit to REDVM/scikit-learn that referenced this pull request Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants