-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] Fix #6420 Cloning decision tree estimators breaks criterion objects #7680
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hmm, travis-ci fails with:
I saw this error on local machine (Python 2.7 Ubuntu 16.04), but after |
Wait I can reset the cache for you. Thanks for working on this. I'll review in a while |
please add a test! |
free(self.n_classes) | ||
|
||
def __reduce__(self): | ||
return (ClassificationCriterion, | ||
(self.n_outputs, | ||
sizet_ptr_to_ndarray(self.n_classes, self.n_outputs)), | ||
sizet_ptr_to_ndarray(self.n_classes, self.n_outputs).copy()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you make a copy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because numpy array which is returned by
sizet_ptr_to_ndarray(self.n_classes, self.n_outputs))
Doesn't own underlying memory (I've added a comment in sizet_ptr_to_ndarray
), it just uses pointer on values which are stored in C array self.n_classes
. When you get this tuple from __repr__
, and object which you got it from gets removed, you can't read this tuple anymore (You can actually, but self.n_classes gets freed, and returned numpy array will show you some garbage, because it still uses pointer to provided C array for him). copy method does deep copy, and resulting array will own underlying memory.
I could alternatively just make arr = np.zeros(self.n_outputs)
, and copy all values to it in cycle. But this approach with sizet_ptr_to_ndarray().copy()
existed in_tree.pyx
before me, so I thought it will be more convenient to use it here too.
@@ -596,6 +602,12 @@ cdef class Gini(ClassificationCriterion): | |||
= 1 - \sum_{k=0}^{K-1} count_k ** 2 | |||
""" | |||
|
|||
def __reduce__(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we not need these before? Is it not enough to use type(self)
as the first member of the tuple in {Classification,Regression}Criterion
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, you're right, it's enough to use type(self) in {Classification,Regression}Criterion.
@@ -62,7 +62,10 @@ cdef inline UINT32_t our_rand_r(UINT32_t* seed) nogil: | |||
|
|||
|
|||
cdef inline np.ndarray sizet_ptr_to_ndarray(SIZE_t* data, SIZE_t size): | |||
"""Encapsulate data into a 1D numpy array of intp's.""" | |||
"""Encapsulate data into a 1D numpy array of intp's. | |||
You should ensure that the provided data pointer is not freed while the returned array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like everywhere this is used it is now used with a copy. I propose we just do the copying here. WDYT? @raghavrv?
Thanks for both your great investigation and patch! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we want a separate non-regression test for clone
... And apart from these nitpicks and making sizet_ptr_to_ndarray
do the copying, this LGTM
@@ -1609,3 +1612,22 @@ def test_mae(): | |||
dt_mae.fit([[3],[5],[3],[8],[5]],[6,7,3,4,3], [0.6,0.3,0.1,1.0,0.3]) | |||
assert_array_equal(dt_mae.tree_.impurity, [7.0/2.3, 3.0/0.7, 4.0/1.6]) | |||
assert_array_equal(dt_mae.tree_.value.flat, [4.0, 6.0, 4.0]) | |||
|
|||
def test_criterion_copy(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: extra blank line, please
n_classes = np.arange(3) | ||
for _, typename in CRITERIA_CLF.items(): | ||
criteria = typename(n_outputs, n_classes) | ||
typename_, (n_outputs_, n_classes_), _ = copy(criteria).__reduce__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check deepcopy too, just to be sure
typename_, (n_outputs_, n_samples_), _ = copy(criteria).__reduce__() | ||
assert_equal(typename, typename_) | ||
assert_equal(n_outputs, n_outputs_) | ||
assert_equal(n_samples, n_samples_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pedantic: should have newline at end of file
0b7992c
to
04a5437
Compare
@@ -65,7 +65,7 @@ cdef inline np.ndarray sizet_ptr_to_ndarray(SIZE_t* data, SIZE_t size): | |||
"""Encapsulate data into a 1D numpy array of intp's.""" | |||
cdef np.npy_intp shape[1] | |||
shape[0] = <np.npy_intp> size | |||
return np.PyArray_SimpleNewFromData(1, shape, np.NPY_INTP, data) | |||
return np.PyArray_SimpleNewFromData(1, shape, np.NPY_INTP, data).copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jnothman Thanks for suggesting this change. Indeed this is better.
@olologin Could you also remove the copy from this line and update the comment there to note that the function returns a copy and does not refer to the passed pointers...
Additionally it would be nice to change the docstring of this helper to note that we return a copy of numpy array and not not simply 'Encapsulate data' as it is mentioned currently...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
for copy_func in [copy.copy, copy.deepcopy]: | ||
for _, typename in CRITERIA_CLF.items(): | ||
criteria = typename(n_outputs, n_classes) | ||
result = copy_func(criteria).__reduce__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather could you pickle, load all the Criteria?
Currently this fails in master...
import pickle
from sklearn.tree._criterion import FriedmanMSE
fmse = FriedmanMSE(1, 10)
pickle.loads(pickle.dumps(fmse))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added pickling to this list.
Pickle and copy could work differently, but particularly with criteria they use the same reduce method.
from sklearn import datasets | ||
|
||
from sklearn.utils import compute_sample_weight | ||
|
||
CLF_CRITERIONS = ("gini", "entropy") | ||
REG_CRITERIONS = ("mse", "mae") | ||
REG_CRITERIONS = ("mse", "mae", "friedman_mse") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh boy wasn't this tested at all ;(
LGTM, thanks! |
ac24db4
to
bc3c530
Compare
Build on one of travis-ci configurations fails with:
I wonder if it's related to my changes. I didn't see this error on travis-ci after first commits, It appeared only when I changed new array creation to .copy(). Is it coincidence? If not - I think I should rewrite |
Yes, I was a little uncertain about the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've cleared the cache for this PR and restarted the travis build. I don't think the current changes cause the failure.
@@ -542,13 +542,12 @@ cdef class Tree: | |||
reaching node i. | |||
""" | |||
# Wrap for outside world. | |||
# WARNING: these reference the current `nodes` and `value` buffers, which | |||
# WARNING: these copy the current `nodes` and `value` buffers, which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No they still reference. Only the n_classes
is copied. This comment should be left intact :)
I suspect they do :) |
I pulled this PR, and pushed it to my fork. Travis passes smoothly there. The restart build option in travis doesn't clear the cache and retry, which is why this PR is still failing after clearing it's cache :) I think @olologin needs to force push to rebuild this PR without any cache side effects... |
Oh, okay. You're probably right. On 20 October 2016 at 01:22, Raghav RV notifications@github.com wrote:
|
bc3c530
to
a0fd37a
Compare
@raghavrv mystical behaviour of Travis :), it fails with the same problems here again. |
@olologin I've made a small PR (olologin#3) to your fork. Can you accept and see if the travis failure vanishes? |
@raghavrv, yep, let's try this. |
Yohoo. It passes now. Now, could you delete my commit, the merge commit and force push to see if the cache is reset completely? |
cdb6133
to
a895cd4
Compare
Argh it now reuses the old travis cache. Apologies for putting you through this... :/ As a final attempt, could you squash all commits into one and force push? (I've cleared your cache again) |
what usually worked for me is to push trivial commits travis on the files that weren't being recythonized (a comment will do), have them pass and cached, then revert back those changes. |
a895cd4
to
da5a85b
Compare
I was looking for an automated way for all such future failures and came up with this curl -Ls https://goo.gl/jhGwkV | git apply -v --index; git commit -m "NOMERGE recythonize all"; |
i think you need to also change recythonize |
b41f9e3
to
a5fe72e
Compare
@nelson-liu , I didn't change that file, I don't think it needs recompilation. Will try now approach proposed by @raghavrv. |
hmm, the errors look like you should have to, but i just always recythonized all the tree files whenever i got the error for good measure. |
@@ -56,6 +56,10 @@ Bug fixes | |||
<https://github.com/scikit-learn/scikit-learn/pull/6178>`_) by `Bertrand | |||
Thirion`_ | |||
|
|||
- Tree splitting criterion classes cloning/pickling are now memory safe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
classes
-> classes'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, thanks!
a5fe72e
to
98eaad3
Compare
Finally travis passes :D After #7708, one should be able to add |
@jnothman please merge if happy with the changes... |
Thanks again, @olologin |
…ion objects (scikit-learn#7680) # Conflicts: # doc/whats_new.rst
* tag '0.18.1': (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ...
* releases: (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ... Conflicts: removed sklearn/externals/joblib/__init__.py sklearn/externals/joblib/_parallel_backends.py sklearn/externals/joblib/testing.py
* dfsg: (144 commits) skip tree-test on 32bit do the warning test as we do it in other places. Replase assert_equal by assert_almost_equal in cosine test version bump 0.18.1 fix merge conflict mess in whatsnew add the python2.6 warning to 0.18.1 fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR. sync whatsnew with master [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553) FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680) Add whats new entry for scikit-learn#6282 (scikit-learn#7629) [MGR + 2] fix selectFdr bug (scikit-learn#7490) fixed whatsnew cherry-pick mess (somewhat) [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874) [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764) [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824) Fix docstring typo (scikit-learn#7844) n_features --> n_components [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818) [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799) DOC : fix docstring of AIC/BIC in GMM ...
Reference Issue
Fixes #6420
What does this implement/fix? Explain your changes.
This fix changes
__reduce__
methods of criterion objects which are used for pickling/cloning.There was memory management problem, function
sizet_ptr_to_ndarray
fromtree/_utils.pyx
returned np.array object, but this np.array object didn't own underlying memory. In some cases we call__repr__
on criterion object, this__repr__
returns needed info for pickling/cloning, this info includes n_classes array (forClassificationCriterion
), and Criterion object gets removed. If we read info returned by__repr__
after this - n_classes will be filled with garbage.Also I added
__repr__
functions for classes which didn't have them, because otherwise class of object after unpickling/cloning changes to more common. For example after cloningGini
criterion we will getClassificationCriterion
.Any other comments?
Make suggestions