Skip to content

[MRG+3] FIX Memory leak in MAE; Use safe_realloc; Acquire GIL only when raising; Propagate all errors to python interpreter level (#7811) #8002

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jan 18, 2017

Conversation

raghavrv
Copy link
Member

@raghavrv raghavrv commented Dec 7, 2016

Fixes #7811

  • Fixes memory leak in MAE (Using safe_realloc instead of master's alloc/calloc-and-forget-about-previously-held-memory approach took care of the memory leak.)
  • Uniformly use safe_realloc everywhere (Added StackRecord* and PriorityHeapRecord* to fused type realloc_ptr`)
  • Acquire GIL only when an error needs to be raised
  • Propagate all errors to python interpreter level. (Use except * when appropriate)

cc: @glouppe @jmschrei @nelson-liu @agramfort @lesteve

Is it okay to release gil in safe_realloc? Any reason why it was not done before?

Also ref my mails to cython-devel:

@raghavrv raghavrv added the Bug label Dec 7, 2016
@raghavrv raghavrv added this to the 0.19 milestone Dec 7, 2016
@nelson-liu
Copy link
Contributor

Thanks for tackling this, @raghavrv . Did you run memory benchmarks to verify that safe_realloc fixes the issue? I thought it was more deep seated than that...(though I'd be happy if it was not 😄 )

@lesteve
Copy link
Member

lesteve commented Dec 7, 2016

I ran the snippet from #7811 (comment) on this PR and the memory leak seems to be taken care of.

Not a cython expert so I won't comment about the actual changes introduced by this PR.

@raghavrv
Copy link
Member Author

raghavrv commented Dec 7, 2016

Do you think such a snippet could be added as a test? I don't think there is precedence for memory leak tests. But I think we should have... Maybe as a more general test as a part of #4841?

@raghavrv
Copy link
Member Author

raghavrv commented Dec 7, 2016

@nelson-liu Yeah! Loic's snippet seems to be stable in this PR...

@lesteve
Copy link
Member

lesteve commented Dec 7, 2016

Do you think such a snippet could be added as a test?

It takes plus than one minute to run so the short answer is probably not in its current form.

We have some memory usage based tests in joblib. Look at https://github.com/joblib/joblib/blob/master/joblib/test/test_numpy_pickle.py#L368 and https://github.com/joblib/joblib/blob/master/joblib/test/common.py#L40 for example. In our experience they tend to be a bit brittle (you need the memory peak to be not too short-lived for memory_profiler to have a chance to see it ...).

@jmschrei
Copy link
Member

jmschrei commented Dec 7, 2016

From a cython POV this looks good, but I haven't tested it to make sure it fixes the issue.

if array == NULL:
# no free; __dealloc__ handles that
return -1
self.array_ = array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not quite relevant to the issue at hand, but do you think these same fixes should be applied for the other data structures in utils.pyx?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it can be extended to all the data structures but I first need to confirm if releasing gil in safe_realloc is safe... I can't figure out why it was not done so far... @glouppe or @jmschrei can help us figure that out maybe? :) If the current changes look okay, then I can replace all malloc calls with safe_realloc... There is also yet another line in tree.pyx where safe_realloc was explicitly avoided because it used to not release gil...

The thing is tree tests are not very extensive as they need to be and I am paranoid that these kind of clean ups extraneous to issue at hand might break user code in subtle ways...

Thx for raising the question!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine. I'm guessing it wasn't put in originally because there was no need to. If it's not using and GIL operations anyway then there shouldn't be a difference.

@raghavrv raghavrv requested a review from glouppe December 8, 2016 01:26
return -1
self.array_ = array
with gil:
safe_realloc(&self.array_, self.capacity)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a stupid question (I'm not an expert in Cython :[ ): why do you need with gil here and not in l346 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a non expert in cython, I had exactly the same thought :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that some of the changes proposed include editing safe_realloc to be a nogil function, so as a nonexpert in cython I feel like l346 is correct and the GIL block isn't necessary here...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started out with a gil blocking safe_realloc and tried to make it release gil to see how it performs. I forgot to update here... Will push a commit. Thanks for the catch!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem it's also a way to me to learn cython :)

raise MemoryError("could not allocate (%d * %d) bytes"
% (nelems, sizeof(p[0][0])))
with gil:
raise MemoryError("could not allocate (%d * %d) bytes"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These exceptions aren't propagated i.e. I don't think it will actually end the function's execution.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try

%%cython
cdef a() nogil:
  with gil:
    raise ValueError('Something broke')
def b():
    a()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True but having except * ensures that it propagates back to the calling function (and then to interpreter)

%%cython
cdef void a() nogil except *:
  with gil:
    raise ValueError('Something broke')
def b():
    a()
b()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I forgot those details. Haven't checked now. Just check that all nogil paths continue to propagate by that means.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Thanks!

And I think as long as gil is acquired when we raise the error, it will propagate when except * construct is used... Infact raising cannot be done without gil for the same reason... I think this is why the whole safe_realloc was not made to release gil? I now changed it to acquire gil only when the memory is not sufficient and hence error has to be raised...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With that confirmation could I have your +1 for merge? This would be nice to have as long as there aren't any side effects...

@raghavrv
Copy link
Member Author

From a cython POV this looks good, but I haven't tested it to make sure it fixes the issue.

@jmschrei With Loic's confirmation that it fixes the issue, are you +1 for merge?

@@ -341,8 +343,7 @@ cdef class WeightedPQueue:
cdef void reset(self) nogil:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but doesn't this require an except clause??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And same for push?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and every nogil function that calls them?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually brings up a bigger issue. Will you need to change every function/method in the tree directory which calls safe_realloc to make sure it propogates the error back?

@jnothman
Copy link
Member

jnothman commented Dec 13, 2016 via email

@raghavrv
Copy link
Member Author

Sorry for the delay in the response...

I think we should use except * by default for all compiled methods / functions written in cython... But releasing the gil for the whole function should not affect propagation of error. It's the except *... Releasing gil raises a compile time exception that errors cannot be raised without gil... Hence apart from the current changes (which acquire gil only when an error needs to be raised), I think adding except * for all cdef functions should solve it... I'll push a commit in a moment...

Eitherways I think we need to fix this memory leak soon...

@raghavrv
Copy link
Member Author

raghavrv commented Dec 20, 2016

Or for this PR let's have the except * only for functions that would end up calling safe_realloc and revisit this issue of adding "except * in every cdef" at a later time...

@raghavrv
Copy link
Member Author

raghavrv commented Dec 20, 2016

In nested calls, it seems like the calling function need not have except Arghh I was wrong the try.. except makes it print. The calling function and every higher level function needs to have this except *

%%cython

cdef void gil_noexcept():
    raise ValueError('At gil_noexcept: This will not be propagated')
    
cdef void nogil_noexcept() nogil:
    # Acquire gil only when you want to raise
    with gil:
        raise ValueError('At nogil_noexcept: This will also not be propagated')
    
cdef void gil_except() except *:
    raise ValueError('At gil_except: This will be propagated')

cdef void nogil_except() nogil except *:
    # Acquire gil only when you want to raise
    with gil:
        raise ValueError('At nogil_except: This will also be propagated')
            
cdef void super_function() except *:
    for fn in (gil_noexcept, nogil_noexcept,
               gil_except, nogil_except):
        try:
            fn()
        except Exception as e:
            print(e)

def python_fn():
    super_function()

python_fn()
At gil_except: This will be propagated
At nogil_except: This will also be propagated
Exception ignored in: '_cython_magic_c6c634898ce44ef1e9180f02f86d5f76.gil_noexcept'
ValueError: At gil_noexcept: This will not be propagated
Exception ignored in: '_cython_magic_c6c634898ce44ef1e9180f02f86d5f76.nogil_noexcept'
ValueError: At nogil_noexcept: This will also not be propagated

@raghavrv
Copy link
Member Author

So from cython docs -

If you don’t do anything special, a function declared with cdef that does not return a Python object has no way of reporting Python exceptions to its caller. If an exception is detected in such a function, a warning message is printed and the exception is ignored. If you want a C function that does not return a Python object to be able to propagate exceptions to its caller, you need to declare an exception value for it.

There is also a third form of exception value declaration:

cdef int spam() except *:

This form causes Cython to generate a call to PyErr_Occurred() after every call to spam, regardless of what value it returns. If you have a function returning void that needs to propagate errors, you will have to use this form, since there isn’t any return value to test. Otherwise there is little use for this form.

I think it's safe to assume that all cdef declarations need an except * (not except -1) since some functions return -1 and some have void as the return type...

@jnothman @glouppe @jmschrei WDYT? Can I also request a comment from other cython experts (@larsmans or @jakevdp) here please :)

@raghavrv
Copy link
Member Author

raghavrv commented Dec 21, 2016

(Manually checking which ones call functions (that call others maybe and so on) that raise an error is error-prone / could change in the future and we do not have tests for these kind of memory errors / leaks.

Which is why I'm suggesting adding "except *" for all cdef declarations as a best practice in tree code...)

@jnothman jnothman changed the title [MRG + 1] FIX Memory leak in MAE; Use safe_realloc; Acquire GIL only when raising; Propagate all errors to python interpreter level (#7811) [MRG+2] FIX Memory leak in MAE; Use safe_realloc; Acquire GIL only when raising; Propagate all errors to python interpreter level (#7811) Jan 18, 2017
@raghavrv
Copy link
Member Author

I was thinking of merging this and fixing it in another PR... But appveyor is not green anyhow. I'll fix it here... And thanks for pointing out, that is correct safe_realloc and except would suffice to raise MemoryErrors...

@raghavrv
Copy link
Member Author

And done... The appveyor seems to be stuck in the queue for a long time...

@raghavrv
Copy link
Member Author

(Or it is taking it's revenge on me for pushing so many times :@)

@jnothman
Copy link
Member

A bit of both. I've just cancelled a whole lot of AppVeyor runs to see if it'll come unstuck. I've not yet found one in progress (not queued, not finished, not cancelled), so I don't know why it's stuck.

@NelleV
Copy link
Member

NelleV commented Jan 18, 2017

I don' think the appveyor problem is related to scikit-learn. Matplotlib currently has the same problem.

@jnothman
Copy link
Member

Indeed even the cancelling operation has very slow throughput.

@raghavrv
Copy link
Member Author

raghavrv commented Jan 18, 2017

According to them, this build seems to be the issue... Anyway now I can see one build running...

@jnothman
Copy link
Member

But things are starting again, now.

@raghavrv
Copy link
Member Author

Thanks!! Meanwhile can you take a look at the final commit and approve. The appveyor should pass in about 30 minutes. I'll merge in an hour...

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think those duplicate errors are correctly removed.

@jnothman jnothman changed the title [MRG+2] FIX Memory leak in MAE; Use safe_realloc; Acquire GIL only when raising; Propagate all errors to python interpreter level (#7811) [MRG+3] FIX Memory leak in MAE; Use safe_realloc; Acquire GIL only when raising; Propagate all errors to python interpreter level (#7811) Jan 18, 2017
@raghavrv raghavrv merged commit 4907029 into scikit-learn:master Jan 18, 2017
@raghavrv
Copy link
Member Author

Thanks a lot for the reviews everyone!

@raghavrv raghavrv deleted the mae_mem_leak branch January 18, 2017 23:59
@glouppe
Copy link
Contributor

glouppe commented Jan 19, 2017

Thanks for the fix!

(and sorry about not having time for a review, my bandwidth has been quite limited these days for scikit-learn, unfortunately...)

@raghavrv
Copy link
Member Author

@jnothman Do we need a bugfix whatsnew for this?

@raghavrv
Copy link
Member Author

(and sorry about not having time for a review, my bandwidth has been quite limited these days for scikit-learn, unfortunately...)

@glouppe No problem :) I'll have a new PR very soon for you ;P

sergeyf pushed a commit to sergeyf/scikit-learn that referenced this pull request Feb 28, 2017
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002)

* FIX MAE reg. criterion: Use safe_realloc to avoid memory leak

* Release GIL in safe_realloc and clean up scaffolding

* As gil is released in safe_realloc, no need of a with gil block

* Use except * to propagate error in all cdef functions

* Don't use except * for functions that return python objects

* Don't use except * for the comparison function passed to qsort

* Omissions and Errors

* Use safe_realloc now that gil is released there

* Fix realloc size

* Acquire GIL only if we need to raise

* Use except * more judiciously; Release gil only when raising; Add comments to clarify

* Actually that was unneeded; realloc will also allocate for the first time

* StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc

* Use except -1 to propagate exceptions. This should avoid overheads

* Fix docstrings and add return 0 to reset methods

* TYPO

* REVIEW Remove redundant MemoryError raising calls
@Przemo10 Przemo10 mentioned this pull request Mar 17, 2017
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002)

* FIX MAE reg. criterion: Use safe_realloc to avoid memory leak

* Release GIL in safe_realloc and clean up scaffolding

* As gil is released in safe_realloc, no need of a with gil block

* Use except * to propagate error in all cdef functions

* Don't use except * for functions that return python objects

* Don't use except * for the comparison function passed to qsort

* Omissions and Errors

* Use safe_realloc now that gil is released there

* Fix realloc size

* Acquire GIL only if we need to raise

* Use except * more judiciously; Release gil only when raising; Add comments to clarify

* Actually that was unneeded; realloc will also allocate for the first time

* StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc

* Use except -1 to propagate exceptions. This should avoid overheads

* Fix docstrings and add return 0 to reset methods

* TYPO

* REVIEW Remove redundant MemoryError raising calls
NelleV pushed a commit to NelleV/scikit-learn that referenced this pull request Aug 11, 2017
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002)

* FIX MAE reg. criterion: Use safe_realloc to avoid memory leak

* Release GIL in safe_realloc and clean up scaffolding

* As gil is released in safe_realloc, no need of a with gil block

* Use except * to propagate error in all cdef functions

* Don't use except * for functions that return python objects

* Don't use except * for the comparison function passed to qsort

* Omissions and Errors

* Use safe_realloc now that gil is released there

* Fix realloc size

* Acquire GIL only if we need to raise

* Use except * more judiciously; Release gil only when raising; Add comments to clarify

* Actually that was unneeded; realloc will also allocate for the first time

* StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc

* Use except -1 to propagate exceptions. This should avoid overheads

* Fix docstrings and add return 0 to reset methods

* TYPO

* REVIEW Remove redundant MemoryError raising calls
@Tommalla
Copy link

Not sure whether this or #8623 is the more appropriate place to write this, but I believe I have an issue with RandomForestRegressor leaking memory when used with the MAE criterion.

My setup is the following: I'm drawing some learning curves, so I'm training and evaluating the model on increasingly bigger fragments of the dataset (I divide it into 30 "chunks").

My dataset is 6640 x 7, numpy.float32.

With every new iteration of train -> predict -> replace the model variable with a new instance, I get a significant increase in memory consumption, up to the point when I run out of memory (10GB, around 9GB free for the experiment).

If I skip the learning curves and just train the model once on the whole dataset, it fits nicely and consumes around 3GB of memory. Likewise, if I change the criterion to MSE and leave the other parameters unchanged, I can run the whole program, including incremental training, and the memory usage doesn't go above 200MB.

I believe the leak happens somewhere in the predict method, as reducing the number of predict calls after the model is trained significantly reduces the memory consumption (around 800MB per predict for that dataset).

My current configuration:

>>> import platform; print(platform.platform())
Linux-4.12.4-1-ARCH-x86_64-with-arch-Arch-Linux
>>> import sys; print("Python", sys.version)
Python 3.6.2 (default, Jul 20 2017, 03:52:27) 
[GCC 7.1.1 20170630]
>>> import numpy; print("NumPy", numpy.__version__)
NumPy 1.13.1
>>> import scipy; print("SciPy", scipy.__version__)
SciPy 0.19.1
>>> import sklearn; print("Scikit-Learn", sklearn.__version__)
Scikit-Learn 0.18.2

Parameters passed to the RandomForestRegessor are: n_estimators=5, max_features='auto', criterion='mae'

@jnothman
Copy link
Member

jnothman commented Aug 15, 2017 via email

@Tommalla
Copy link

@jnothman My bad, missed the new release and thought this was included in 0.18.2. Works fine now.

Thank you!

paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002)

* FIX MAE reg. criterion: Use safe_realloc to avoid memory leak

* Release GIL in safe_realloc and clean up scaffolding

* As gil is released in safe_realloc, no need of a with gil block

* Use except * to propagate error in all cdef functions

* Don't use except * for functions that return python objects

* Don't use except * for the comparison function passed to qsort

* Omissions and Errors

* Use safe_realloc now that gil is released there

* Fix realloc size

* Acquire GIL only if we need to raise

* Use except * more judiciously; Release gil only when raising; Add comments to clarify

* Actually that was unneeded; realloc will also allocate for the first time

* StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc

* Use except -1 to propagate exceptions. This should avoid overheads

* Fix docstrings and add return 0 to reset methods

* TYPO

* REVIEW Remove redundant MemoryError raising calls
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
…en raising; Propagate all errors to python interpreter level (scikit-learn#7811) (scikit-learn#8002)

* FIX MAE reg. criterion: Use safe_realloc to avoid memory leak

* Release GIL in safe_realloc and clean up scaffolding

* As gil is released in safe_realloc, no need of a with gil block

* Use except * to propagate error in all cdef functions

* Don't use except * for functions that return python objects

* Don't use except * for the comparison function passed to qsort

* Omissions and Errors

* Use safe_realloc now that gil is released there

* Fix realloc size

* Acquire GIL only if we need to raise

* Use except * more judiciously; Release gil only when raising; Add comments to clarify

* Actually that was unneeded; realloc will also allocate for the first time

* StackRecord*, PriorityHeapRecord* to fused type realloc_ptr; Use safe_realloc

* Use except -1 to propagate exceptions. This should avoid overheads

* Fix docstrings and add return 0 to reset methods

* TYPO

* REVIEW Remove redundant MemoryError raising calls
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants