Skip to content

FIX remove max_samples in RandomTreesEmbedding #15693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Nov 26, 2019

Conversation

glemaitre
Copy link
Member

Reference Issues/PRs

closes #15670

What does this implement/fix? Explain your changes.

Remove the parameter max_samples since RandomTreesEmbedding does not allow to turn bootstrap=True. It should be ported in 0.22.

Any other comments?

@glemaitre glemaitre added this to the 0.22 milestone Nov 21, 2019
@adrinjalali
Copy link
Member

I think the CI fails cause:

/home/circleci/project/sklearn/ensemble/_forest.py:docstring of sklearn.ensemble.ExtraTreesRegressor:136: WARNING: Bullet list ends without a blank line;

otherwise LGTM.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough but actually I am not sure what this class should do. Shouldn't we make it possible to enable bootstrapping?

Also by reading the code I saw that we are using a uniformly random y internally, but the same target y for all the trees in the ensemble. Wouldn't it make more sense to use independently resampled ys for each tree to get a less redundant (correlated) while still neighbor preserving, high dimensional sparse representation of the original data?

Any opinion @glouppe or others?

@ogrisel
Copy link
Member

ogrisel commented Nov 22, 2019

Actually because max_feature=1 is hard-coded, maybe we get independent trees even with a shared fixed random y because the mse criterion is never ever used to compare the relative quality of 2 candidate splits? For each split in each tree we sample 1 feature and then 1 sample on that node at random and y is actually never used.

The ccp_alpha and pruning is probably meaning less for this estimator though. Since it wasn't released yet (flagged as 0.22) maybe we should remove it prior the final release.

@ogrisel
Copy link
Member

ogrisel commented Nov 22, 2019

So bootstrapping and max_samples are actually useless.

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unable to understand the failure from Circle CI, otherwise LGTM

:class:`ensemble.ExtraTreesRegressor`,
:class:`ensemble.RandomTreesEmbedding`. :pr:`14682` by
:user:`Matt Hancock <notmatthancock>` and
:class:`ensemble.ExtraTreesRegressor`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove ForestClassifier and ForestRegressor

- If int, then draw `max_samples` samples.
- If float, then draw `max_samples * X.shape[0]` samples. Thus,
`max_samples` should be in the interval `(0, 1)`.

.. versionadded:: 0.22
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also need to remove this

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from Hammond's comment lgtm

Copy link
Member

@qinhanmin2014 qinhanmin2014 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will merge when green

@qinhanmin2014 qinhanmin2014 merged commit 6419f65 into scikit-learn:master Nov 26, 2019
@NicolasHug
Copy link
Member

Don't we need a deprecation? We may be break users' code

@adrinjalali
Copy link
Member

I think this was added in 0.22, so removing it is safe

adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request Nov 26, 2019
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RandomTreesEmbedding has a max_samples parameter but no bootstrap
6 participants