-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+3] FIX Memory leak in MAE; Use safe_realloc; Acquire GIL only when raising; Propagate all errors to python interpreter level (#7811) #8002
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Thanks for tackling this, @raghavrv . Did you run memory benchmarks to verify that safe_realloc fixes the issue? I thought it was more deep seated than that...(though I'd be happy if it was not 😄 ) |
I ran the snippet from #7811 (comment) on this PR and the memory leak seems to be taken care of. Not a cython expert so I won't comment about the actual changes introduced by this PR. |
Do you think such a snippet could be added as a test? I don't think there is precedence for memory leak tests. But I think we should have... Maybe as a more general test as a part of #4841? |
@nelson-liu Yeah! Loic's snippet seems to be stable in this PR... |
It takes plus than one minute to run so the short answer is probably not in its current form. We have some memory usage based tests in joblib. Look at https://github.com/joblib/joblib/blob/master/joblib/test/test_numpy_pickle.py#L368 and https://github.com/joblib/joblib/blob/master/joblib/test/common.py#L40 for example. In our experience they tend to be a bit brittle (you need the memory peak to be not too short-lived for memory_profiler to have a chance to see it ...). |
From a cython POV this looks good, but I haven't tested it to make sure it fixes the issue. |
if array == NULL: | ||
# no free; __dealloc__ handles that | ||
return -1 | ||
self.array_ = array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not quite relevant to the issue at hand, but do you think these same fixes should be applied for the other data structures in utils.pyx?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it can be extended to all the data structures but I first need to confirm if releasing gil in safe_realloc
is safe... I can't figure out why it was not done so far... @glouppe or @jmschrei can help us figure that out maybe? :) If the current changes look okay, then I can replace all malloc
calls with safe_realloc
... There is also yet another line in tree.pyx
where safe_realloc
was explicitly avoided because it used to not release gil...
The thing is tree tests are not very extensive as they need to be and I am paranoid that these kind of clean ups extraneous to issue at hand might break user code in subtle ways...
Thx for raising the question!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be fine. I'm guessing it wasn't put in originally because there was no need to. If it's not using and GIL operations anyway then there shouldn't be a difference.
return -1 | ||
self.array_ = array | ||
with gil: | ||
safe_realloc(&self.array_, self.capacity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a stupid question (I'm not an expert in Cython :[ ): why do you need with gil
here and not in l346 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a non expert in cython, I had exactly the same thought :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that some of the changes proposed include editing safe_realloc to be a nogil function, so as a nonexpert in cython I feel like l346 is correct and the GIL block isn't necessary here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I started out with a gil blocking safe_realloc
and tried to make it release gil to see how it performs. I forgot to update here... Will push a commit. Thanks for the catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem it's also a way to me to learn cython :)
raise MemoryError("could not allocate (%d * %d) bytes" | ||
% (nelems, sizeof(p[0][0]))) | ||
with gil: | ||
raise MemoryError("could not allocate (%d * %d) bytes" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These exceptions aren't propagated i.e. I don't think it will actually end the function's execution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try
%%cython
cdef a() nogil:
with gil:
raise ValueError('Something broke')
def b():
a()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True but having except *
ensures that it propagates back to the calling function (and then to interpreter)
%%cython
cdef void a() nogil except *:
with gil:
raise ValueError('Something broke')
def b():
a()
b()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I forgot those details. Haven't checked now. Just check that all nogil paths continue to propagate by that means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. Thanks!
And I think as long as gil is acquired when we raise the error, it will propagate when except *
construct is used... Infact raising cannot be done without gil for the same reason... I think this is why the whole safe_realloc
was not made to release gil? I now changed it to acquire gil only when the memory is not sufficient and hence error has to be raised...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With that confirmation could I have your +1 for merge? This would be nice to have as long as there aren't any side effects...
@jmschrei With Loic's confirmation that it fixes the issue, are you +1 for merge? |
@@ -341,8 +343,7 @@ cdef class WeightedPQueue: | |||
cdef void reset(self) nogil: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but doesn't this require an except
clause??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And same for push
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and every nogil function that calls them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually brings up a bigger issue. Will you need to change every function/method in the tree directory which calls safe_realloc to make sure it propogates the error back?
When we (@larsmans, IIRC) first implemented safe_realloc, we required GIL
to ensure error propagation. As far as I can tell, all callers of
safe_realloc atm use the GIL and are likely not called within nogil
contexts. This PR changes that.
…On 13 December 2016 at 05:20, Jacob Schreiber ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/tree/_utils.pyx
<#8002>:
> @@ -341,8 +343,7 @@ cdef class WeightedPQueue:
cdef void reset(self) nogil:
This actually brings up a bigger issue. Will you need to change every
function/method in the tree directory which calls safe_realloc to make sure
it propogates the error back?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8002>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AAEz692iI0Vwfr4zGvGro2pq0yZSf0SQks5rHZBngaJpZM4LG1Il>
.
|
Sorry for the delay in the response... I think we should use Eitherways I think we need to fix this memory leak soon... |
Or for this PR let's have the |
%%cython
cdef void gil_noexcept():
raise ValueError('At gil_noexcept: This will not be propagated')
cdef void nogil_noexcept() nogil:
# Acquire gil only when you want to raise
with gil:
raise ValueError('At nogil_noexcept: This will also not be propagated')
cdef void gil_except() except *:
raise ValueError('At gil_except: This will be propagated')
cdef void nogil_except() nogil except *:
# Acquire gil only when you want to raise
with gil:
raise ValueError('At nogil_except: This will also be propagated')
cdef void super_function() except *:
for fn in (gil_noexcept, nogil_noexcept,
gil_except, nogil_except):
try:
fn()
except Exception as e:
print(e)
def python_fn():
super_function()
python_fn() At gil_except: This will be propagated
At nogil_except: This will also be propagated
Exception ignored in: '_cython_magic_c6c634898ce44ef1e9180f02f86d5f76.gil_noexcept'
ValueError: At gil_noexcept: This will not be propagated
Exception ignored in: '_cython_magic_c6c634898ce44ef1e9180f02f86d5f76.nogil_noexcept'
ValueError: At nogil_noexcept: This will also not be propagated |
So from cython docs -
I think it's safe to assume that all @jnothman @glouppe @jmschrei WDYT? Can I also request a comment from other cython experts (@larsmans or @jakevdp) here please :) |
(Manually checking which ones call functions (that call others maybe and so on) that raise an error is error-prone / could change in the future and we do not have tests for these kind of memory errors / leaks. Which is why I'm suggesting adding "except *" for all cdef declarations as a best practice in tree code...) |
be31ba0
to
086506c
Compare
I was thinking of merging this and fixing it in another PR... But appveyor is not green anyhow. I'll fix it here... And thanks for pointing out, that is correct |
And done... The appveyor seems to be stuck in the queue for a long time... |
(Or it is taking it's revenge on me for pushing so many times :@) |
A bit of both. I've just cancelled a whole lot of AppVeyor runs to see if it'll come unstuck. I've not yet found one in progress (not queued, not finished, not cancelled), so I don't know why it's stuck. |
I don' think the appveyor problem is related to scikit-learn. Matplotlib currently has the same problem. |
Indeed even the cancelling operation has very slow throughput. |
According to them, this build seems to be the issue... Anyway now I can see one build running... |
But things are starting again, now. |
Thanks!! Meanwhile can you take a look at the final commit and approve. The appveyor should pass in about 30 minutes. I'll merge in an hour... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think those duplicate errors are correctly removed.
Thanks a lot for the reviews everyone! |
Thanks for the fix! (and sorry about not having time for a review, my bandwidth has been quite limited these days for scikit-learn, unfortunately...) |
@jnothman Do we need a bugfix whatsnew for this? |
@glouppe No problem :) I'll have a new PR very soon for you ;P |
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002) * FIX MAE reg. criterion: Use safe_realloc to avoid memory leak * Release GIL in safe_realloc and clean up scaffolding * As gil is released in safe_realloc, no need of a with gil block * Use except * to propagate error in all cdef functions * Don't use except * for functions that return python objects * Don't use except * for the comparison function passed to qsort * Omissions and Errors * Use safe_realloc now that gil is released there * Fix realloc size * Acquire GIL only if we need to raise * Use except * more judiciously; Release gil only when raising; Add comments to clarify * Actually that was unneeded; realloc will also allocate for the first time * StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc * Use except -1 to propagate exceptions. This should avoid overheads * Fix docstrings and add return 0 to reset methods * TYPO * REVIEW Remove redundant MemoryError raising calls
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002) * FIX MAE reg. criterion: Use safe_realloc to avoid memory leak * Release GIL in safe_realloc and clean up scaffolding * As gil is released in safe_realloc, no need of a with gil block * Use except * to propagate error in all cdef functions * Don't use except * for functions that return python objects * Don't use except * for the comparison function passed to qsort * Omissions and Errors * Use safe_realloc now that gil is released there * Fix realloc size * Acquire GIL only if we need to raise * Use except * more judiciously; Release gil only when raising; Add comments to clarify * Actually that was unneeded; realloc will also allocate for the first time * StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc * Use except -1 to propagate exceptions. This should avoid overheads * Fix docstrings and add return 0 to reset methods * TYPO * REVIEW Remove redundant MemoryError raising calls
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002) * FIX MAE reg. criterion: Use safe_realloc to avoid memory leak * Release GIL in safe_realloc and clean up scaffolding * As gil is released in safe_realloc, no need of a with gil block * Use except * to propagate error in all cdef functions * Don't use except * for functions that return python objects * Don't use except * for the comparison function passed to qsort * Omissions and Errors * Use safe_realloc now that gil is released there * Fix realloc size * Acquire GIL only if we need to raise * Use except * more judiciously; Release gil only when raising; Add comments to clarify * Actually that was unneeded; realloc will also allocate for the first time * StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc * Use except -1 to propagate exceptions. This should avoid overheads * Fix docstrings and add return 0 to reset methods * TYPO * REVIEW Remove redundant MemoryError raising calls
Not sure whether this or #8623 is the more appropriate place to write this, but I believe I have an issue with RandomForestRegressor leaking memory when used with the MAE criterion. My setup is the following: I'm drawing some learning curves, so I'm training and evaluating the model on increasingly bigger fragments of the dataset (I divide it into 30 "chunks"). My dataset is 6640 x 7, numpy.float32. With every new iteration of train -> predict -> replace the model variable with a new instance, I get a significant increase in memory consumption, up to the point when I run out of memory (10GB, around 9GB free for the experiment). If I skip the learning curves and just train the model once on the whole dataset, it fits nicely and consumes around 3GB of memory. Likewise, if I change the criterion to MSE and leave the other parameters unchanged, I can run the whole program, including incremental training, and the memory usage doesn't go above 200MB. I believe the leak happens somewhere in the My current configuration:
Parameters passed to the RandomForestRegessor are: n_estimators=5, max_features='auto', criterion='mae' |
please check with scikit-learn version 0.19. thanks
…On 15 Aug 2017 4:57 am, "Tommalla" ***@***.***> wrote:
Not sure whether this or #8623
<#8623> is the more
appropriate place to write this, but I believe I have an issue with
RandomForestRegressor leaking memory when used with the MAE criterion.
My setup is the following: I'm drawing some learning curves, so I'm
training and evaluating the model on increasingly bigger fragments of the
dataset (I divide it into 30 "chunks").
My dataset is 6640 x 7, numpy.float32.
With every new iteration of train -> predict -> replace the model variable
with a new instance, I get a significant increase in memory consumption, up
to the point when I run out of memory (10GB, around 9GB free for the
experiment).
If I skip the learning curves and just train the model once on the whole
dataset, it fits nicely and consumes around 3GB of memory. Likewise, if I
change the criterion to MSE and leave the other parameters unchanged, I can
run the whole program, including incremental training, and the memory usage
doesn't go above 200MB.
I believe the leak happens somewhere in the predict method, as reducing
the number of predict calls after the model is trained significantly
reduces the memory consumption (around 800MB per predict for that
dataset).
My current configuration:
>>> import platform; print(platform.platform())
Linux-4.12.4-1-ARCH-x86_64-with-arch-Arch-Linux
>>> import sys; print("Python", sys.version)
Python 3.6.2 (default, Jul 20 2017, 03:52:27)
[GCC 7.1.1 20170630]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.13.1
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 0.19.1
>>> import sklearn; print("Scikit-Learn", sklearn.__version__)
Scikit-Learn 0.18.2
Parameters passed to the RandomForestRegessor are: n_estimators=5,
max_features='auto', criterion='mae'
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#8002 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz6wrI5Fvbc40It-t-SeXaKgZoPF21ks5sYJiSgaJpZM4LG1Il>
.
|
@jnothman My bad, missed the new release and thought this was included in 0.18.2. Works fine now. Thank you! |
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002) * FIX MAE reg. criterion: Use safe_realloc to avoid memory leak * Release GIL in safe_realloc and clean up scaffolding * As gil is released in safe_realloc, no need of a with gil block * Use except * to propagate error in all cdef functions * Don't use except * for functions that return python objects * Don't use except * for the comparison function passed to qsort * Omissions and Errors * Use safe_realloc now that gil is released there * Fix realloc size * Acquire GIL only if we need to raise * Use except * more judiciously; Release gil only when raising; Add comments to clarify * Actually that was unneeded; realloc will also allocate for the first time * StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc * Use except -1 to propagate exceptions. This should avoid overheads * Fix docstrings and add return 0 to reset methods * TYPO * REVIEW Remove redundant MemoryError raising calls
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002) * FIX MAE reg. criterion: Use safe_realloc to avoid memory leak * Release GIL in safe_realloc and clean up scaffolding * As gil is released in safe_realloc, no need of a with gil block * Use except * to propagate error in all cdef functions * Don't use except * for functions that return python objects * Don't use except * for the comparison function passed to qsort * Omissions and Errors * Use safe_realloc now that gil is released there * Fix realloc size * Acquire GIL only if we need to raise * Use except * more judiciously; Release gil only when raising; Add comments to clarify * Actually that was unneeded; realloc will also allocate for the first time * StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc * Use except -1 to propagate exceptions. This should avoid overheads * Fix docstrings and add return 0 to reset methods * TYPO * REVIEW Remove redundant MemoryError raising calls
Fixes #7811
safe_realloc
instead of master'salloc
/calloc
-and-forget-about-previously-held-memory approach took care of the memory leak.)safe_realloc
everywhere (AddedStackRecord* and PriorityHeapRecord* to fused type
realloc_ptr`)except *
when appropriate)cc: @glouppe @jmschrei @nelson-liu @agramfort @lesteve
Is it okay to release gil in
safe_realloc
? Any reason why it was not done before?Also ref my mails to cython-devel:
except *
in cdef