Skip to content

[MRG+2] FIX adaboost estimators not randomising correctly #7411

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Sep 23, 2016

Conversation

jnothman
Copy link
Member

@jnothman jnothman commented Sep 13, 2016

Fixes #7408 , where adaboost estimators were given the same random state initialisation, resulting in poor performance.

At the same time, I realised the need for setting nested random_states. I think this should be a public utility. This PR also ensures nested random_state is set in ensembles and in common tests.

Caveats:

  • AdaBoost results will differ from previously, by necessity
  • Bagging results will differ from previously, unnecessarily, but reducing code duplication and fixing an issue of nested estimators. Objections welcome.
  • testing random_states (via set_random_state) will differ from previously
  • sklearn.utils.randomize_estimator is a bad name for the new utility to set nested random_state. Suggestions welcome!
  • sklearn.utils.testing.set_random_state has a better name, but is subtly and valuably different from the new function:
    • it defaults to random_state=0 rather than system-wide random state
    • it ignores warnings
  • CV splitters should support {get,set}_params() if they are to be affected by the new utility. (If and when they do get this support, it could change the states set for other parts of the estimator by the new utility, due to parameter iteration order.) PR welcome, IMO
  • other things with random_state but without {get,set}_params are not affected by the new utility; this is noted in its docstring.

@amueller
Copy link
Member

hm looks good.
I'm a bit confused. I thought something similar recently happened in BaggingEstimator but I might have dreamt that...

random_state = check_random_state(random_state)
to_set = {}
for key in sorted(estimator.get_params(deep=True)):
if key == 'random_state' or key.endswith('_random_state'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we want / need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pipelines have random states too! This isn't necessarily the only way to do this, but having integer random_states in each component means that unit can be replicated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There the application would be trying to make an estimator deterministic. That is not really related to the issue we're trying to fix here, is it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The random ensembles currently initialise each estimator with an integer random state derived from their local random state. However, currently they only use the random_state param. They should be setting every random_state param.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... true from a reproducibility perspective. The question is a bit how much of this can / should we put on the user.... Or on the Pipeline? The contract could also be "setting random_state on an object makes it deterministic". Then it would be the Pipelines problem. But that might be too much magic? It would be a pretty clear contract, though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an object depends on something that we don't control, then that object breaks the contract and it's not our fault ;)

Copy link
Member Author

@jnothman jnothman Sep 13, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too much magic and highly impractical. It means every meta-estimator or wrapper needs a random_state param and needs to manage it to do just the same as this... Here we manage things like dict iteration order invariance; you're going to need a shared helper function anyway. We already define nested parameter management and the random_state convention. Let's use it.

That having been said, I think it might be a good idea for RandomizedSearchCV to (optionally) manage the random_state of scipy.stats RVs

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That having been said, I think it might be a good idea for RandomizedSearchCV to (optionally) manage the random_state of scipy.stats RVs

That already happens.

Hm but shouldn't this only match __random_state if we want to match the sub-estimator random state?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought there would be no harm in allowing something to have multiple distinct parameters ending _random_state, but also including __random_state. No, it's not tested; yes, I can change it.

@jnothman
Copy link
Member Author

@amueller and @GaelVaroquaux , let me know if you think I should make this more conservative, i.e. to only effect the ensembles.

@jnothman
Copy link
Member Author

Pushing more conservative version. Adding what's new in case you want to include the fix in 0.18-final.

@jnothman jnothman added this to the 0.18 milestone Sep 14, 2016
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the following minor comments, LGTM.

random_state = check_random_state(random_state)
to_set = {}
for key in sorted(estimator.get_params(deep=True)):
if key == 'random_state' or key.endswith('_random_state'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Souldn't it be double _: key.endswith('__random_state') instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amueller (I think) also questioned this. I wanted, but did not test, this to be applicable to the hypothetical case that an estimator had multiple random_states for multiple purposes. Is that silly?

I'll suppose I'll change it so it doesn't look wrong...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The multiple random_states case sounds like a YAGNI to me.

ensemble._make_estimator(append=False)

assert_equal(3, len(ensemble))
assert_equal(3, len(ensemble.estimators_))

assert_true(isinstance(ensemble[0], Perceptron))
assert_equal(ensemble[0].random_state, None)
assert_true(isinstance(ensemble[1].random_state, int))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness you could add:

    assert_true(isinstance(ensemble[2].random_state, int))

@@ -113,6 +113,11 @@ def test_iris():
assert score > 0.9, "Failed with algorithm %s and score = %f" % \
(alg, score)

# Check we used multiple estimators
assert_true(len(clf.estimators_) > 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better user assert_greater(len(clf.estimators_), 1) to get more informative error messages (when using nosetests).

# Check we used multiple estimators
assert_true(len(clf.estimators_) > 1)
# Check for distinct random states (see issue #7408)
assert_true(len(set(est.random_state for est in clf.estimators_)) > 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_greater(len(set(est.random_state for est in clf.estimators_)), 1)

# Check we used multiple estimators
assert_true(len(reg.estimators_) > 1)
# Check for distinct random states (see issue #7408)
assert_true(len(set(est.random_state for est in reg.estimators_)) > 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert_greater

@jnothman
Copy link
Member Author

Changes made, thanks @ogrisel. Am I leaving the what's new in 0.18 or has that train passed?

@jnothman
Copy link
Member Author

rebasing and marking MRG+1 due to @ogrisel's LGTM.

@jnothman jnothman changed the title [MRG] FIX adaboost estimators not randomising correctly [MRG+1] FIX adaboost estimators not randomising correctly Sep 22, 2016
# Check we used multiple estimators
assert_greater(len(clf.estimators_), 1)
# Check for distinct random states (see issue #7408)
assert_greater(len(set(est.random_state
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't they be equal to len(clf.estimators_) (with high probability)?

# Check we used multiple estimators
assert_true(len(reg.estimators_) > 1)
# Check for distinct random states (see issue #7408)
assert_greater(len(set(est.random_state for est in reg.estimators_)), 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@amueller
Copy link
Member

LGTM apart from minor nitpick in tests.

Copy link
Member

@NelleV NelleV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@NelleV NelleV changed the title [MRG+1] FIX adaboost estimators not randomising correctly [MRG+2] FIX adaboost estimators not randomising correctly Sep 22, 2016
@GaelVaroquaux
Copy link
Member

LGTM.

Waiting for a couple of comments by @amueller in the tests before merging (on the len of unique random_state). But everything is for me. +1 for merge

@jnothman
Copy link
Member Author

Tests improved. Thanks for the reviews.

@jnothman
Copy link
Member Author

Merging. @amueller, please backport.

@jnothman jnothman merged commit 32d1236 into scikit-learn:master Sep 23, 2016
@jnothman
Copy link
Member Author

(Actually, now I'm wondering if I should have merged before CIs were green... I checked that those more specific tests passed locally. Please excuse my unjustified hurry.)

@amueller
Copy link
Member

would have probably been better to wait, but if master fails we'll know ;)

@amueller
Copy link
Member

I'll backport everything all at once, when we're ready to release, I think.

amueller pushed a commit that referenced this pull request Sep 25, 2016
* FIX adaboost estimators not randomising correctly

(fixes #7408)

FIX ensure nested random_state is set in ensembles

* DOC add what's new

* Only affect *__random_state, not *_random_state for now

* TST More informative assertions for ensemble tests

* More specific testing of different random_states
TomDLT pushed a commit to TomDLT/scikit-learn that referenced this pull request Oct 3, 2016
…rn#7411)

* FIX adaboost estimators not randomising correctly

(fixes scikit-learn#7408)

FIX ensure nested random_state is set in ensembles

* DOC add what's new

* Only affect *__random_state, not *_random_state for now

* TST More informative assertions for ensemble tests

* More specific testing of different random_states
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 10, 2016
* tag '0.18': (1286 commits)
  [MRG + 1] More versionadded everywhere! (scikit-learn#7403)
  minor doc fixes
  fix lbfgs rename (scikit-learn#7503)
  minor fixes to whatsnew
  fix scoring function table
  fix rebase messup
  DOC more what's new subdivision
  DOC Attempt to impose some order on What's New 0.18
  no fixed width within bold
  REL changes for release in 0.18.X branch (scikit-learn#7414)
  [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325)
  DOC: Added Nested Cross Validation Example (scikit-learn#7111)
  Sync docstring and definition default argument in kneighbors (scikit-learn#7476)
  added contributors for 0.18, minor formatting fixes.
  Fix typo in whats_new.rst
  [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411)
  Addressing issue scikit-learn#7468. (scikit-learn#7472)
  Reorganize README
  clean up deprecation warning stuff in common tests
  [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438)
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 10, 2016
* releases: (1286 commits)
  [MRG + 1] More versionadded everywhere! (scikit-learn#7403)
  minor doc fixes
  fix lbfgs rename (scikit-learn#7503)
  minor fixes to whatsnew
  fix scoring function table
  fix rebase messup
  DOC more what's new subdivision
  DOC Attempt to impose some order on What's New 0.18
  no fixed width within bold
  REL changes for release in 0.18.X branch (scikit-learn#7414)
  [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325)
  DOC: Added Nested Cross Validation Example (scikit-learn#7111)
  Sync docstring and definition default argument in kneighbors (scikit-learn#7476)
  added contributors for 0.18, minor formatting fixes.
  Fix typo in whats_new.rst
  [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411)
  Addressing issue scikit-learn#7468. (scikit-learn#7472)
  Reorganize README
  clean up deprecation warning stuff in common tests
  [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438)
  ...
yarikoptic added a commit to yarikoptic/scikit-learn that referenced this pull request Nov 10, 2016
* dfsg: (1286 commits)
  [MRG + 1] More versionadded everywhere! (scikit-learn#7403)
  minor doc fixes
  fix lbfgs rename (scikit-learn#7503)
  minor fixes to whatsnew
  fix scoring function table
  fix rebase messup
  DOC more what's new subdivision
  DOC Attempt to impose some order on What's New 0.18
  no fixed width within bold
  REL changes for release in 0.18.X branch (scikit-learn#7414)
  [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325)
  DOC: Added Nested Cross Validation Example (scikit-learn#7111)
  Sync docstring and definition default argument in kneighbors (scikit-learn#7476)
  added contributors for 0.18, minor formatting fixes.
  Fix typo in whats_new.rst
  [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411)
  Addressing issue scikit-learn#7468. (scikit-learn#7472)
  Reorganize README
  clean up deprecation warning stuff in common tests
  [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438)
  ...
Sundrique pushed a commit to Sundrique/scikit-learn that referenced this pull request Jun 14, 2017
…rn#7411)

* FIX adaboost estimators not randomising correctly

(fixes scikit-learn#7408)

FIX ensure nested random_state is set in ensembles

* DOC add what's new

* Only affect *__random_state, not *_random_state for now

* TST More informative assertions for ensemble tests

* More specific testing of different random_states
paulha pushed a commit to paulha/scikit-learn that referenced this pull request Aug 19, 2017
…rn#7411)

* FIX adaboost estimators not randomising correctly

(fixes scikit-learn#7408)

FIX ensure nested random_state is set in ensembles

* DOC add what's new

* Only affect *__random_state, not *_random_state for now

* TST More informative assertions for ensemble tests

* More specific testing of different random_states
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in AdaBoostRegressor with randomstate
5 participants