Skip to content

[MRG+2] Fix #6420 Cloning decision tree estimators breaks criterion objects #7680

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 19, 2016

Conversation

olologin
Copy link
Contributor

@olologin olologin commented Oct 16, 2016

Reference Issue

Fixes #6420

What does this implement/fix? Explain your changes.

This fix changes __reduce__ methods of criterion objects which are used for pickling/cloning.

There was memory management problem, function sizet_ptr_to_ndarray from tree/_utils.pyx returned np.array object, but this np.array object didn't own underlying memory. In some cases we call __repr__ on criterion object, this __repr__ returns needed info for pickling/cloning, this info includes n_classes array (for ClassificationCriterion), and Criterion object gets removed. If we read info returned by __repr__ after this - n_classes will be filled with garbage.

from sklearn.tree import tree
import copy
import numpy as np

gini = tree._criterion.ClassificationCriterion(n_outputs=1, n_classes=np.array([2]))
gini.__reduce__()
copy.deepcopy(gini).__reduce__()
    Out[1] (sklearn.tree._criterion.ClassificationCriterion, (1, array([4314681712])), {})

Also I added __repr__ functions for classes which didn't have them, because otherwise class of object after unpickling/cloning changes to more common. For example after cloning Gini criterion we will get ClassificationCriterion.

from sklearn.tree import tree
import copy
import numpy as np

gini = tree._criterion.Gini(n_outputs=1, n_classes=np.array([2]))
gini_copy = copy.deepcopy(gini)
gini_copy.__class__
    Out[1]: sklearn.tree._criterion.ClassificationCriterion

Any other comments?

Make suggestions

@olologin
Copy link
Contributor Author

olologin commented Oct 16, 2016

@amueller, @jnothman

Hmm, travis-ci fails with:

ValueError: sklearn.tree._criterion.Criterion has the wrong size, try recompiling. Expected 160, got 152
ImportError: cannot import name _tree

I saw this error on local machine (Python 2.7 Ubuntu 16.04), but after make clean it disappeared. Can we do something similar on server? Seems like cache related problem for me.

@nelson-liu
Copy link
Contributor

@olologin , yup that's an issue with the travis cache. maybe time to look into #7094 again?

@olologin olologin changed the title [WIP] Fix #6420 Cloning decision tree estimators breaks criterion objects [MRG] Fix #6420 Cloning decision tree estimators breaks criterion objects Oct 17, 2016
@raghavrv
Copy link
Member

Wait I can reset the cache for you. Thanks for working on this. I'll review in a while

@amueller amueller added this to the 0.18.1 milestone Oct 17, 2016
@amueller
Copy link
Member

Maybe @jmschrei also wants to check it out, or @glouppe ?

@amueller
Copy link
Member

please add a test!

free(self.n_classes)

def __reduce__(self):
return (ClassificationCriterion,
(self.n_outputs,
sizet_ptr_to_ndarray(self.n_classes, self.n_outputs)),
sizet_ptr_to_ndarray(self.n_classes, self.n_outputs).copy()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you make a copy?

Copy link
Contributor Author

@olologin olologin Oct 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because numpy array which is returned by

sizet_ptr_to_ndarray(self.n_classes, self.n_outputs))

Doesn't own underlying memory (I've added a comment in sizet_ptr_to_ndarray), it just uses pointer on values which are stored in C array self.n_classes. When you get this tuple from __repr__, and object which you got it from gets removed, you can't read this tuple anymore (You can actually, but self.n_classes gets freed, and returned numpy array will show you some garbage, because it still uses pointer to provided C array for him). copy method does deep copy, and resulting array will own underlying memory.

I could alternatively just make arr = np.zeros(self.n_outputs), and copy all values to it in cycle. But this approach with sizet_ptr_to_ndarray().copy() existed in_tree.pyx before me, so I thought it will be more convenient to use it here too.

@@ -596,6 +602,12 @@ cdef class Gini(ClassificationCriterion):
= 1 - \sum_{k=0}^{K-1} count_k ** 2
"""

def __reduce__(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we not need these before? Is it not enough to use type(self) as the first member of the tuple in {Classification,Regression}Criterion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, you're right, it's enough to use type(self) in {Classification,Regression}Criterion.

@@ -62,7 +62,10 @@ cdef inline UINT32_t our_rand_r(UINT32_t* seed) nogil:


cdef inline np.ndarray sizet_ptr_to_ndarray(SIZE_t* data, SIZE_t size):
"""Encapsulate data into a 1D numpy array of intp's."""
"""Encapsulate data into a 1D numpy array of intp's.
You should ensure that the provided data pointer is not freed while the returned array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like everywhere this is used it is now used with a copy. I propose we just do the copying here. WDYT? @raghavrv?

@jnothman
Copy link
Member

Thanks for both your great investigation and patch!

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want a separate non-regression test for clone... And apart from these nitpicks and making sizet_ptr_to_ndarray do the copying, this LGTM

@@ -1609,3 +1612,22 @@ def test_mae():
dt_mae.fit([[3],[5],[3],[8],[5]],[6,7,3,4,3], [0.6,0.3,0.1,1.0,0.3])
assert_array_equal(dt_mae.tree_.impurity, [7.0/2.3, 3.0/0.7, 4.0/1.6])
assert_array_equal(dt_mae.tree_.value.flat, [4.0, 6.0, 4.0])

def test_criterion_copy():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: extra blank line, please

n_classes = np.arange(3)
for _, typename in CRITERIA_CLF.items():
criteria = typename(n_outputs, n_classes)
typename_, (n_outputs_, n_classes_), _ = copy(criteria).__reduce__()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check deepcopy too, just to be sure

typename_, (n_outputs_, n_samples_), _ = copy(criteria).__reduce__()
assert_equal(typename, typename_)
assert_equal(n_outputs, n_outputs_)
assert_equal(n_samples, n_samples_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pedantic: should have newline at end of file

@olologin olologin force-pushed the 6420_dt_criterion branch 4 times, most recently from 0b7992c to 04a5437 Compare October 18, 2016 23:46
@@ -65,7 +65,7 @@ cdef inline np.ndarray sizet_ptr_to_ndarray(SIZE_t* data, SIZE_t size):
"""Encapsulate data into a 1D numpy array of intp's."""
cdef np.npy_intp shape[1]
shape[0] = <np.npy_intp> size
return np.PyArray_SimpleNewFromData(1, shape, np.NPY_INTP, data)
return np.PyArray_SimpleNewFromData(1, shape, np.NPY_INTP, data).copy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnothman Thanks for suggesting this change. Indeed this is better.

@olologin Could you also remove the copy from this line and update the comment there to note that the function returns a copy and does not refer to the passed pointers...

Additionally it would be nice to change the docstring of this helper to note that we return a copy of numpy array and not not simply 'Encapsulate data' as it is mentioned currently...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

for copy_func in [copy.copy, copy.deepcopy]:
for _, typename in CRITERIA_CLF.items():
criteria = typename(n_outputs, n_classes)
result = copy_func(criteria).__reduce__()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather could you pickle, load all the Criteria?

Currently this fails in master...

import pickle

from sklearn.tree._criterion import FriedmanMSE

fmse = FriedmanMSE(1, 10)
pickle.loads(pickle.dumps(fmse))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added pickling to this list.
Pickle and copy could work differently, but particularly with criteria they use the same reduce method.

from sklearn import datasets

from sklearn.utils import compute_sample_weight

CLF_CRITERIONS = ("gini", "entropy")
REG_CRITERIONS = ("mse", "mae")
REG_CRITERIONS = ("mse", "mae", "friedman_mse")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh boy wasn't this tested at all ;(

@jnothman
Copy link
Member

LGTM, thanks!

@jnothman jnothman changed the title [MRG] Fix #6420 Cloning decision tree estimators breaks criterion objects [MRG+1] Fix #6420 Cloning decision tree estimators breaks criterion objects Oct 19, 2016
@olologin olologin force-pushed the 6420_dt_criterion branch 2 times, most recently from ac24db4 to bc3c530 Compare October 19, 2016 07:36
@olologin
Copy link
Contributor Author

olologin commented Oct 19, 2016

Build on one of travis-ci configurations fails with:

build_tools/travis/test_script.sh: line 23: 4113 Segmentation fault (core dumped) nosetests -s --with-timer --timer-top-n 20 sklearn

I wonder if it's related to my changes. I didn't see this error on travis-ci after first commits, It appeared only when I changed new array creation to .copy(). Is it coincidence? If not - I think I should rewrite sizet_ptr_to_ndarray so it will just copy all values from C array to previously created ndarray, avoiding all this fancy PyArray_SimpleNewFromData methods.

@jnothman
Copy link
Member

Yes, I was a little uncertain about the .copy approach. Perhaps it's not so reliable, despite my LGTM. Use np.frombuffer?

Copy link
Member

@raghavrv raghavrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've cleared the cache for this PR and restarted the travis build. I don't think the current changes cause the failure.

@@ -542,13 +542,12 @@ cdef class Tree:
reaching node i.
"""
# Wrap for outside world.
# WARNING: these reference the current `nodes` and `value` buffers, which
# WARNING: these copy the current `nodes` and `value` buffers, which
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No they still reference. Only the n_classes is copied. This comment should be left intact :)

@jnothman
Copy link
Member

I don't think the current changes cause the failure.

I suspect they do :)

@raghavrv
Copy link
Member

I pulled this PR, and pushed it to my fork. Travis passes smoothly there. The restart build option in travis doesn't clear the cache and retry, which is why this PR is still failing after clearing it's cache :)

I think @olologin needs to force push to rebuild this PR without any cache side effects...

@jnothman
Copy link
Member

Oh, okay. You're probably right.

On 20 October 2016 at 01:22, Raghav RV notifications@github.com wrote:

I pulled this PR, and pushed it to my fork. Travis passes smoothly there
https://travis-ci.org/raghavrv/scikit-learn/builds/168931248. The
restart build option in travis doesn't clear the cache and retry, which is
why this PR is still failing after clearing it's cache :)

I think @olologin https://github.com/olologin needs to force push to
rebuild this PR without any cache side effects...


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7680 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAEz6xsET_Fo73oSCouH65wbLItP-Bq0ks5q1ie4gaJpZM4KYC0m
.

@olologin
Copy link
Contributor Author

@raghavrv mystical behaviour of Travis :), it fails with the same problems here again.

@raghavrv
Copy link
Member

raghavrv commented Oct 19, 2016

@olologin I've made a small PR (olologin#3) to your fork. Can you accept and see if the travis failure vanishes?

@olologin
Copy link
Contributor Author

@raghavrv, yep, let's try this.

@raghavrv
Copy link
Member

Yohoo. It passes now. Now, could you delete my commit, the merge commit and force push to see if the cache is reset completely?

@olologin olologin force-pushed the 6420_dt_criterion branch 2 times, most recently from cdb6133 to a895cd4 Compare October 19, 2016 20:21
@raghavrv
Copy link
Member

raghavrv commented Oct 19, 2016

Argh it now reuses the old travis cache. Apologies for putting you through this... :/ As a final attempt, could you squash all commits into one and force push? (I've cleared your cache again)

@nelson-liu
Copy link
Contributor

what usually worked for me is to push trivial commits travis on the files that weren't being recythonized (a comment will do), have them pass and cached, then revert back those changes.

@raghavrv
Copy link
Member

I was looking for an automated way for all such future failures and came up with this

curl -Ls https://goo.gl/jhGwkV | git apply -v --index; git commit -m "NOMERGE recythonize all";

@nelson-liu
Copy link
Contributor

i think you need to also change recythonize _splitter, no?

@olologin
Copy link
Contributor Author

@nelson-liu , I didn't change that file, I don't think it needs recompilation. Will try now approach proposed by @raghavrv.

@nelson-liu
Copy link
Contributor

hmm, the errors look like you should have to, but i just always recythonized all the tree files whenever i got the error for good measure.

@@ -56,6 +56,10 @@ Bug fixes
<https://github.com/scikit-learn/scikit-learn/pull/6178>`_) by `Bertrand
Thirion`_

- Tree splitting criterion classes cloning/pickling are now memory safe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

classes -> classes'?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, thanks!

@raghavrv
Copy link
Member

Finally travis passes :D After #7708, one should be able to add [ci recythonize] and force a recythonizing of all the cython sources. That should save all of us some trouble in the future...

@raghavrv raghavrv changed the title [MRG+1] Fix #6420 Cloning decision tree estimators breaks criterion objects [MRG+2] Fix #6420 Cloning decision tree estimators breaks criterion objects Oct 19, 2016
@raghavrv
Copy link
Member

@jnothman please merge if happy with the changes...

@jnothman
Copy link
Member

Thanks again, @olologin

@jnothman jnothman merged commit 74e4c42 into scikit-learn:master Oct 19, 2016
amueller added a commit to amueller/scikit-learn that referenced this pull request Oct 25, 2016
amueller pushed a commit to amueller/scikit-learn that referenced this pull request Nov 9, 2016
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017
* tag '0.18.1': (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017
* releases: (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...

 Conflicts:  removed
	sklearn/externals/joblib/__init__.py
	sklearn/externals/joblib/_parallel_backends.py
	sklearn/externals/joblib/testing.py
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Jul 27, 2017
* dfsg: (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this pull request Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cloning decision tree estimators breaks criterion objects
5 participants