-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+2] FIX adaboost estimators not randomising correctly #7411
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0561679
to
8e9e0c1
Compare
hm looks good. |
random_state = check_random_state(random_state) | ||
to_set = {} | ||
for key in sorted(estimator.get_params(deep=True)): | ||
if key == 'random_state' or key.endswith('_random_state'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we want / need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipelines have random states too! This isn't necessarily the only way to do this, but having integer random_states in each component means that unit can be replicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There the application would be trying to make an estimator deterministic. That is not really related to the issue we're trying to fix here, is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The random ensembles currently initialise each estimator with an integer random state derived from their local random state. However, currently they only use the random_state
param. They should be setting every random_state
param.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm... true from a reproducibility perspective. The question is a bit how much of this can / should we put on the user.... Or on the Pipeline
? The contract could also be "setting random_state
on an object makes it deterministic". Then it would be the Pipelines
problem. But that might be too much magic? It would be a pretty clear contract, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an object depends on something that we don't control, then that object breaks the contract and it's not our fault ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Too much magic and highly impractical. It means every meta-estimator or wrapper needs a random_state
param and needs to manage it to do just the same as this... Here we manage things like dict iteration order invariance; you're going to need a shared helper function anyway. We already define nested parameter management and the random_state
convention. Let's use it.
That having been said, I think it might be a good idea for RandomizedSearchCV
to (optionally) manage the random_state
of scipy.stats
RVs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That having been said, I think it might be a good idea for RandomizedSearchCV to (optionally) manage the random_state of scipy.stats RVs
That already happens.
Hm but shouldn't this only match __random_state
if we want to match the sub-estimator random state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought there would be no harm in allowing something to have multiple distinct parameters ending _random_state
, but also including __random_state
. No, it's not tested; yes, I can change it.
@amueller and @GaelVaroquaux , let me know if you think I should make this more conservative, i.e. to only effect the ensembles. |
Pushing more conservative version. Adding what's new in case you want to include the fix in 0.18-final. |
7d4bca5
to
5e275f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the following minor comments, LGTM.
random_state = check_random_state(random_state) | ||
to_set = {} | ||
for key in sorted(estimator.get_params(deep=True)): | ||
if key == 'random_state' or key.endswith('_random_state'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Souldn't it be double _
: key.endswith('__random_state')
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amueller (I think) also questioned this. I wanted, but did not test, this to be applicable to the hypothetical case that an estimator had multiple random_states for multiple purposes. Is that silly?
I'll suppose I'll change it so it doesn't look wrong...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The multiple random_states case sounds like a YAGNI to me.
ensemble._make_estimator(append=False) | ||
|
||
assert_equal(3, len(ensemble)) | ||
assert_equal(3, len(ensemble.estimators_)) | ||
|
||
assert_true(isinstance(ensemble[0], Perceptron)) | ||
assert_equal(ensemble[0].random_state, None) | ||
assert_true(isinstance(ensemble[1].random_state, int)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness you could add:
assert_true(isinstance(ensemble[2].random_state, int))
@@ -113,6 +113,11 @@ def test_iris(): | |||
assert score > 0.9, "Failed with algorithm %s and score = %f" % \ | |||
(alg, score) | |||
|
|||
# Check we used multiple estimators | |||
assert_true(len(clf.estimators_) > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better user assert_greater(len(clf.estimators_), 1)
to get more informative error messages (when using nosetests).
# Check we used multiple estimators | ||
assert_true(len(clf.estimators_) > 1) | ||
# Check for distinct random states (see issue #7408) | ||
assert_true(len(set(est.random_state for est in clf.estimators_)) > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_greater(len(set(est.random_state for est in clf.estimators_)), 1)
# Check we used multiple estimators | ||
assert_true(len(reg.estimators_) > 1) | ||
# Check for distinct random states (see issue #7408) | ||
assert_true(len(set(est.random_state for est in reg.estimators_)) > 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_greater
Changes made, thanks @ogrisel. Am I leaving the what's new in 0.18 or has that train passed? |
b3cc749
to
4751362
Compare
(fixes scikit-learn#7408) FIX ensure nested random_state is set in ensembles
rebasing and marking MRG+1 due to @ogrisel's LGTM. |
4751362
to
3a1ba65
Compare
# Check we used multiple estimators | ||
assert_greater(len(clf.estimators_), 1) | ||
# Check for distinct random states (see issue #7408) | ||
assert_greater(len(set(est.random_state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't they be equal to len(clf.estimators_)
(with high probability)?
# Check we used multiple estimators | ||
assert_true(len(reg.estimators_) > 1) | ||
# Check for distinct random states (see issue #7408) | ||
assert_greater(len(set(est.random_state for est in reg.estimators_)), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
LGTM apart from minor nitpick in tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
LGTM. Waiting for a couple of comments by @amueller in the tests before merging (on the len of unique random_state). But everything is for me. +1 for merge |
Tests improved. Thanks for the reviews. |
Merging. @amueller, please backport. |
(Actually, now I'm wondering if I should have merged before CIs were green... I checked that those more specific tests passed locally. Please excuse my unjustified hurry.) |
would have probably been better to wait, but if master fails we'll know ;) |
I'll backport everything all at once, when we're ready to release, I think. |
* FIX adaboost estimators not randomising correctly (fixes #7408) FIX ensure nested random_state is set in ensembles * DOC add what's new * Only affect *__random_state, not *_random_state for now * TST More informative assertions for ensemble tests * More specific testing of different random_states
…rn#7411) * FIX adaboost estimators not randomising correctly (fixes scikit-learn#7408) FIX ensure nested random_state is set in ensembles * DOC add what's new * Only affect *__random_state, not *_random_state for now * TST More informative assertions for ensemble tests * More specific testing of different random_states
* tag '0.18': (1286 commits) [MRG + 1] More versionadded everywhere! (scikit-learn#7403) minor doc fixes fix lbfgs rename (scikit-learn#7503) minor fixes to whatsnew fix scoring function table fix rebase messup DOC more what's new subdivision DOC Attempt to impose some order on What's New 0.18 no fixed width within bold REL changes for release in 0.18.X branch (scikit-learn#7414) [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325) DOC: Added Nested Cross Validation Example (scikit-learn#7111) Sync docstring and definition default argument in kneighbors (scikit-learn#7476) added contributors for 0.18, minor formatting fixes. Fix typo in whats_new.rst [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411) Addressing issue scikit-learn#7468. (scikit-learn#7472) Reorganize README clean up deprecation warning stuff in common tests [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438) ...
* releases: (1286 commits) [MRG + 1] More versionadded everywhere! (scikit-learn#7403) minor doc fixes fix lbfgs rename (scikit-learn#7503) minor fixes to whatsnew fix scoring function table fix rebase messup DOC more what's new subdivision DOC Attempt to impose some order on What's New 0.18 no fixed width within bold REL changes for release in 0.18.X branch (scikit-learn#7414) [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325) DOC: Added Nested Cross Validation Example (scikit-learn#7111) Sync docstring and definition default argument in kneighbors (scikit-learn#7476) added contributors for 0.18, minor formatting fixes. Fix typo in whats_new.rst [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411) Addressing issue scikit-learn#7468. (scikit-learn#7472) Reorganize README clean up deprecation warning stuff in common tests [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438) ...
* dfsg: (1286 commits) [MRG + 1] More versionadded everywhere! (scikit-learn#7403) minor doc fixes fix lbfgs rename (scikit-learn#7503) minor fixes to whatsnew fix scoring function table fix rebase messup DOC more what's new subdivision DOC Attempt to impose some order on What's New 0.18 no fixed width within bold REL changes for release in 0.18.X branch (scikit-learn#7414) [MRG+2] Timing and training score in GridSearchCV (scikit-learn#7325) DOC: Added Nested Cross Validation Example (scikit-learn#7111) Sync docstring and definition default argument in kneighbors (scikit-learn#7476) added contributors for 0.18, minor formatting fixes. Fix typo in whats_new.rst [MRG+2] FIX adaboost estimators not randomising correctly (scikit-learn#7411) Addressing issue scikit-learn#7468. (scikit-learn#7472) Reorganize README clean up deprecation warning stuff in common tests [MRG+1] Fix regression in silhouette_score for clusters of size 1 (scikit-learn#7438) ...
…rn#7411) * FIX adaboost estimators not randomising correctly (fixes scikit-learn#7408) FIX ensure nested random_state is set in ensembles * DOC add what's new * Only affect *__random_state, not *_random_state for now * TST More informative assertions for ensemble tests * More specific testing of different random_states
…rn#7411) * FIX adaboost estimators not randomising correctly (fixes scikit-learn#7408) FIX ensure nested random_state is set in ensembles * DOC add what's new * Only affect *__random_state, not *_random_state for now * TST More informative assertions for ensemble tests * More specific testing of different random_states
Fixes #7408 , where adaboost estimators were given the same random state initialisation, resulting in poor performance.
At the same time, I realised the need for setting nested
random_state
s. I think this should be a public utility. This PR also ensures nested random_state is set in ensembles and in common tests.Caveats:
set_random_state
) will differ from previouslysklearn.utils.randomize_estimator
is a bad name for the new utility to set nestedrandom_state
. Suggestions welcome!sklearn.utils.testing.set_random_state
has a better name, but is subtly and valuably different from the new function:random_state=0
rather than system-wide random state{get,set}_params()
if they are to be affected by the new utility. (If and when they do get this support, it could change the states set for other parts of the estimator by the new utility, due to parameter iteration order.) PR welcome, IMOrandom_state
but without{get,set}_params
are not affected by the new utility; this is noted in its docstring.