Skip to content

Fixes #11128 : Default n_estimator value should be 100 #11172

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 25 commits into from
Closed

Fixes #11128 : Default n_estimator value should be 100 #11172

wants to merge 25 commits into from

Conversation

Olamyy
Copy link

@Olamyy Olamyy commented May 30, 2018

Fixes #11128

@Olamyy
Copy link
Author

Olamyy commented May 30, 2018

Upcoming commits with deprecation warning in progress.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see our documentation on deprecating default values. Changing it like this would break backwards compatibility

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please addd a test that the warning works correctly. We have helpers like assert_warns

@@ -157,6 +158,10 @@ def __init__(self,
self.warm_start = warm_start
self.class_weight = class_weight

if self.n_estimators == 10:
warnings.warn("'n_estimators' default value was changed to 100 in version 0.20.", DeprecationWarning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we want is "n_estimators default value will change to 100 in version 0.20".

By convention, this warning should only be issued in fit, not __init__.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this in my latest commit.

@@ -142,6 +142,10 @@ Classifiers and regressors
several times in a row. :issue:`9234` by :user:`andrewww <andrewww>`
and :user:`Minghui Liu <minghui-liu>`.

- In both :class:`ensemble.RandomForestClassifier` and `ensemble.ExtraTreesClassifier`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be under API changes, rather than enhancements.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this too in my latest commit.

@Olamyy
Copy link
Author

Olamyy commented Jun 6, 2018

@jnothman For the test, do I create a new, say, test_classifiers file and write the test function in there or how would you best advise this be done?

@jnothman
Copy link
Member

jnothman commented Jun 6, 2018

add a function in sklearn/ensemble/tests/test_forest.py


@pytest.mark.parametrize('name', FOREST_ESTIMATORS)
def test_n_estimator_raises_warning(name):
check_raises_n_estimator_warning(name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to have a separate function.

Please also test that no warning is raised if the user explicitly passed a value

@@ -242,6 +243,11 @@ def fit(self, X, y, sample_weight=None):
-------
self : object
"""
# Validate hyperparameters
if self.n_estimators == 10:
warnings.warn("'n_estimators' default value will change to 100 in version 0.20", DeprecationWarning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to break the user code.
Currently, with your PR, a user using the default value will witness a change in its code, in particular increasing its fit time.

We probably want to just warn about future change:

  • we should not change the value to 100
  • the DeprecationWarning should be a FutureWarning
  • the warning needs to say "will be changed in 0.22", instead of "has changed in 0.20"
  • the warning could also be more pedagogical, explaining this value needs to be changed manually, as suggested in the issue Change default n_estimators in RandomForest (to 100?) #11128 (comment)

Olamyy added 3 commits June 7, 2018 12:03
…20 to 0.22. Provided a simple helper on changing the default value to 100
…20 to 0.22. Provided a simple helper on changing the default value to 100
@@ -0,0 +1,118 @@

=======
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't belong here...

@@ -242,6 +243,10 @@ def fit(self, X, y, sample_weight=None):
-------
self : object
"""
# Validate hyperparameters
if self.n_estimators == 10:
warnings.warn("'n_estimators' default value will be changed to 100 in version 0.22. For optimal performance, you should make change this value to 100", FutureWarning)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that you should have the statement about optimal performance.

Also line length must be <80

ForestClassifier = FOREST_CLASSIFIERS[name]
clf = ForestClassifier(n_estimators=100, random_state=1, max_features=1,
max_depth=1)
assert_warns(DeprecationWarning, func=clf.fit, X=iris.data, y=iris.target)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be assert_no_warnings?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally missed that.

I've done well to fix it in my latest commit.

@@ -242,6 +243,10 @@ def fit(self, X, y, sample_weight=None):
-------
self : object
"""
# Validate hyperparameters
if self.n_estimators == 10:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Often we would do something like change the default to None, and treat it as "warn and use 10" to avoid warning when the user explicitly sets n_estimators=10

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this in the dev guide? I don't think so, right? Also that deprecations should be in fit is not, I think? New issue? ;)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jnorthman, I'm not sure I understand you.
Are you advising I change the n_estimators value to None?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller I've created two issues to take care of both of these.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, defaulting n_estimators=None would be fine. n_estimators='warn'

Copy link
Author

@Olamyy Olamyy Jun 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.
But I'm a bit confused now because @amueller suggested the default for n_estimators should still be 10
here.

Am I missing something or isn't that some sort of a conflict as you're suggesting it should be None.

@qinhanmin2014
Copy link
Member

I think we already have some good PRs to serve as examples (e.g., #9397 #10331 ) ? And I think @TomDLT 's comment(#11172 (comment)) has summarized what to do here.
I might be -1 to create new rules since current rules seems fine (though maybe not perfect) and multiple rules will introduce unnecessary difficulties to future work.
@Olamyy Please take some time to revert unnecessary changes, otherwise it will be really difficult to review your PR.

@@ -46,7 +46,7 @@
cm = metrics.confusion_matrix(y_test, y_predicted)
print(cm)

#import matplotlib.pyplot as plt
#import matlotlib.pyplot as plt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this unrelated change

@@ -755,7 +760,7 @@ class RandomForestClassifier(ForestClassifier):

Parameters
----------
n_estimators : integer, optional (default=10)
n_estimators : integer, optional (default=100)
Copy link
Member

@amueller amueller Jun 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is still 10. There should be a note that it'll change in 0.22

@qinhanmin2014
Copy link
Member

qinhanmin2014 commented Jun 10, 2018

@Olamyy
(1) There are still unrelated changes in doc/tutorial, please carefully check the diff in the PR.
(2) I think we've reached consensus to follow the ways in #9379, #10331, #11172 (comment), #11172 (comment), so you can have a try unless @amueller disagree.
(3) I don;t like words like deprecate in the PR since we're not deprecating anything.

@glemaitre
Copy link
Member

@Olamyy Are you still willing to fix this issue, otherwise I would take care about it.

@Olamyy
Copy link
Author

Olamyy commented Jun 25, 2018

Hi @glemaitre, yes I am. Been a little busy.

I should be able to complete it tonight.

@glemaitre
Copy link
Member

glemaitre commented Jun 25, 2018 via email

@@ -46,7 +46,7 @@
cm = metrics.confusion_matrix(y_test, y_predicted)
print(cm)

#import matplotlib.pyplot as plt
#import matlotlib.pyplot as plt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this unrelated change

@@ -44,8 +44,8 @@
# more useful.
# Fit the pipeline on the training set using grid search for the parameters

# TASK: print the cross-validated scores for the each parameters set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert this unrelated change

ForestClassifier = FOREST_CLASSIFIERS[name]
clf = ForestClassifier(n_estimators=10, random_state=1, max_features=1,
max_depth=1)
assert_warns(FutureWarning, func=clf.fit, X=iris.data, y=iris.target)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly passing n_estimators=10 should not result in a warning.
Passing nothing for n_estimators should result in n warning

@@ -242,6 +243,10 @@ def fit(self, X, y, sample_weight=None):
-------
self : object
"""
# Validate hyperparameters
if self.n_estimators == 10:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, defaulting n_estimators=None would be fine. n_estimators='warn'

@sklearn-lgtm
Copy link

This pull request introduces 1 alert when merging e8147fc into 8083ea4 - view on LGTM.com

new alerts:

  • 1 for Unused import

Comment posted by LGTM.com

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value doesn't matter so much. The default behaviour must be to warn and use 10 estimators. And the behaviour when a number is passed must be to not warn.

@@ -204,6 +204,31 @@ def check_probability(name):
np.exp(clf.predict_log_proba(iris.data)))


@pytest.mark.parametrize('name', FOREST_CLASSIFIERS)
def test_warning_raised_with_deprecated_n_estimator(name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be called "no_warning_raised".

@@ -242,6 +242,12 @@ def fit(self, X, y, sample_weight=None):
-------
self : object
"""
# Validate hyperparameters
if not self.n_estimators:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be "is None".

@@ -652,6 +652,10 @@ Linear, kernelized and related models
:class:`linear_model.LogisticRegression` when ``verbose`` is set to 0.
:issue:`10881` by :user:`Alexandre Sevin <AlexandreSev>`.

- In both :class:`ensemble.RandomForestClassifier` and `ensemble.ExtraTreesClassifier`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing :class: for the extra trees. Also, the message should read something like " n_estimators will be changed to 100 in version 0.22. A FutureWarning is raised when the default value is used." or similar. n_estimators=10doesn't raise a warning (as is tested).

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have the default n_estimators=None.

We would like to have this finished before an upcoming release, as we believe it is important that we correct old wrongs.

I hope it is okay if another contributor completes this work

@qinhanmin2014
Copy link
Member

Resolved in #11542, thanks @Olamyy for your work.

@Olamyy Olamyy deleted the n_estimator_should_be_100 branch July 24, 2018 07:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants