Skip to content

Cloning decision tree estimators breaks criterion objects #6420

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
panisson opened this issue Feb 22, 2016 · 9 comments · Fixed by #7680
Closed

Cloning decision tree estimators breaks criterion objects #6420

panisson opened this issue Feb 22, 2016 · 9 comments · Fixed by #7680
Labels
Bug Moderate Anything that requires some knowledge of conventions and best practices

Comments

@panisson
Copy link

I'm trying to implement different criterions for decision trees.
I've found that decision trees could accept a Criterion object as a criterion parameter:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/tree.py#L335
And the easiest way to implement other criterions would be to implement subclasses of tree._criterion.Criterion class.

The normal way to pass a criterion to a decision tree is by using its string name, and it works fine:

from sklearn import tree, model_selection, metrics, datasets
import numpy as np

X, y = datasets.make_classification(n_samples=1000, random_state=42)
cv = model_selection.KFold(n_folds=10, shuffle=True, random_state=43)

dtc = tree.DecisionTreeClassifier(criterion='gini', random_state=42)
print np.mean(model_selection.cross_val_score(dtc, X, y, cv=cv))

mean score is 0.866.

However, if I use a Criterion object, it does not work anymore:

gini = tree._criterion.Gini(n_outputs=1, n_classes=np.array([2]))
dtc = tree.DecisionTreeClassifier(criterion=gini, random_state=42)
print np.mean(model_selection.cross_val_score(dtc, X, y, cv=cv))

mean score now is 0.476.
It seems that the cloning of the decision tree is breaking the criterion object in some way, because this code is also not working:

from sklearn.base import clone

gini = tree._criterion.Gini(n_outputs=1, n_classes=np.array([2]))

dtc = tree.DecisionTreeClassifier(criterion=gini, random_state=42)
scores = []
for train_idx, test_idx in cv.split(X, y):
    estimator = clone(dtc)
    estimator.fit(X[train_idx], y[train_idx])
    scores.append(metrics.accuracy_score(y[test_idx], estimator.predict(X[test_idx])))
print np.mean(scores)

but if I reset the criterion object of the estimator by, e.g.,

estimator.criterion = dtc.criterion

then the score values are back to normal.

I could not find where the cloning is breaking the criterion object, any help would be welcome.

Thanks all for your effort on this project, sklearn is really great!

regards
André

@amueller amueller added the Bug label Oct 7, 2016
@amueller
Copy link
Member

amueller commented Oct 7, 2016

ping @jmschrei @glouppe @arjoly ?

@jnothman
Copy link
Member

jnothman commented Oct 8, 2016

>>> gini = tree._criterion.Gini(n_outputs=1, n_classes=np.array([2]))
>>> dtc = tree.DecisionTreeClassifier(criterion=gini, random_state=42)
>>> gini.__reduce__()
(sklearn.tree._criterion.ClassificationCriterion, (1, array([2])), {})
>>> base.clone(dtc).criterion.__reduce__()
(sklearn.tree._criterion.ClassificationCriterion, (1, array([5764616311966949241])), {})

clone relies on copy.deepcopy:

>>> copy.deepcopy(gini).__reduce__()
(sklearn.tree._criterion.ClassificationCriterion, (1, array([4314681712])), {})

@amueller
Copy link
Member

amueller commented Oct 8, 2016

So deepcopy of criteria is broken?

@jnothman
Copy link
Member

jnothman commented Oct 8, 2016

Yes, though I've not worked out why, despite spending some time tracing deepcopy.

@jnothman jnothman added Moderate Anything that requires some knowledge of conventions and best practices Need Contributor labels Oct 13, 2016
@jnothman
Copy link
Member

Need Debugger label?

@amueller
Copy link
Member

is need debugger the same as need contributor or as need reviewer or something else? lol

@olologin
Copy link
Contributor

Looks like I found the problem, np.array object returned from repr doesn't own underlying memory, so unerlying memory gets freed after ClassificationCriterion is destroyed.

copy.deepcopy(gini).reduce()
here we get reduce result from copy, and then copy object is removed, after this reduce result is filled with garbage.

Also, there are some other funny problems, like:

gini = tree._criterion.Gini(n_outputs=1, n_classes=np.array([2]))
gini_copy = copy.deepcopy(gini)
gini_copy.__class__
Out[1]: sklearn.tree._criterion.ClassificationCriterion

I'll try to fix this and make PR.

@olologin
Copy link
Contributor

@panisson, Could you check your original problem on this branch #7680 ?

@panisson
Copy link
Author

This branch solves my original problem. Thanks!

olologin added a commit to olologin/scikit-learn that referenced this issue Oct 19, 2016
olologin added a commit to olologin/scikit-learn that referenced this issue Oct 19, 2016
amueller added a commit to amueller/scikit-learn that referenced this issue Oct 25, 2016
amueller pushed a commit to amueller/scikit-learn that referenced this issue Nov 9, 2016
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this issue Jun 14, 2017
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this issue Jul 27, 2017
* tag '0.18.1': (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this issue Jul 27, 2017
* releases: (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...

 Conflicts:  removed
	sklearn/externals/joblib/__init__.py
	sklearn/externals/joblib/_parallel_backends.py
	sklearn/externals/joblib/testing.py
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this issue Jul 27, 2017
* dfsg: (144 commits)
  skip tree-test on 32bit
  do the warning test as we do it in other places.
  Replase assert_equal by assert_almost_equal in cosine test
  version bump 0.18.1
  fix merge conflict mess in whatsnew
  add the python2.6 warning to 0.18.1
  fix learning_curve test that I messed up in cherry-picking the "reentrant cv" PR.
  sync whatsnew with master
  [MRG] TST Ensure __dict__ is unmodified by predict, transform, etc (scikit-learn#7553)
  FIX scikit-learn#6420: Cloning decision tree estimators breaks criterion objects (scikit-learn#7680)
  Add whats new entry for scikit-learn#6282 (scikit-learn#7629)
  [MGR + 2] fix selectFdr bug (scikit-learn#7490)
  fixed whatsnew cherry-pick mess (somewhat)
  [MRG + 2] FIX LogisticRegressionCV to correctly handle string labels (scikit-learn#5874)
  [MRG + 2] Fixed parameter setting in SelectFromModel (scikit-learn#7764)
  [MRG+2] DOC adding separate `fit()` methods (and docstrings) for DecisionTreeClassifier and DecisionTreeRegressor (scikit-learn#7824)
  Fix docstring typo (scikit-learn#7844) n_features --> n_components
  [MRG + 1] DOC adding :user: role to whats_new (scikit-learn#7818)
  [MRG+1] label binarizer not used consistently in CalibratedClassifierCV (scikit-learn#7799)
  DOC : fix docstring of AIC/BIC in GMM
  ...
paulha pushed a commit to paulha/scikit-learn that referenced this issue Aug 19, 2017
maskani-moh pushed a commit to maskani-moh/scikit-learn that referenced this issue Nov 15, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Moderate Anything that requires some knowledge of conventions and best practices
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants