FIX make creation of dataset deterministic in SGD #19716

PierreAttard · 2021-03-18T20:40:42Z

Reference Issues/PRs

Fixes #19603

What does this implement/fix? Explain your changes.

check_random_state moved up before make_dataset in order to use the same random state during the whole process.
This change occurs in classes SGDRegressor, SparseSGDRegressor, fit method.

Any other comments?

I had to adapt the existing unitest test_validation_set_not_used_for_training in file test_sgd.py :

With this change, the random state is "used" one time before the following line :

validation_mask = self._make_validation_split(y)

So, I reproduce this effect in the unitest with those changes :

        rng = np.random.RandomState(seed)
        rng.randint(1, np.iinfo(np.int32).max)
        cv = ShuffleSplit(test_size=validation_fraction,
                          random_state=rng)

I am not totally satisfied, what do you think about ?

If you're OK, I add text in what news doc and add a non-regression test to check that the model is fully deterministic when seeding the random state like said @ogrisel

Thanks in advance for responses !

…itest adaptation

ogrisel

Could you please add a non-regression test that check that calling SGDRegressor twice on the same data without waiting for convergence (for instance using max_iter=1, gives the same model.coef_ and model.intercept_ attribute?

ogrisel · 2021-03-19T14:29:38Z

sklearn/linear_model/tests/test_sgd.py

        cv = ShuffleSplit(test_size=validation_fraction,
-                          random_state=seed)
+                          random_state=rng)


I do not understand the motivation behind those changes. Passing seed directly was fine, no?

Sorry I had not read the description of the PR...

I think we need to rework the test to make it less dependent on internal details. Not sure how though...

You can probably just move the creation of the validation split (the lines that define validation_mask and validation_score_cb) just after the call to check_random_state.

It's still dependent a bit on internal details but the test would appear less convoluted.

That's what I explain in the PR message.

Any other comments?

I had to adapt the existing unitest test_validation_set_not_used_for_training in file test_sgd.py :

With this change, the random state is "used" one time before the following line :

validation_mask = self._make_validation_split(y)

So, I reproduce this effect in the unitest with those changes :

rng = np.random.RandomState(seed) rng.randint(1, np.iinfo(np.int32).max) cv = ShuffleSplit(test_size=validation_fraction, random_state=rng)

I am not totally satisfied, what do you think about ?

If you're OK, I add text in what news doc and add a non-regression test to check that the model is fully deterministic when seeding the random state like said @ogrisel

Thanks in advance for responses !

If you use the seed, the results will be different because inside _fit_regressor method, the random object is run a first time with the given seed before the use of _make_validation_split. That is why in the test, in order to reproduce that behavior, I create rng which is run a first time before the validaiton split. But I am not really convinced.

Sorry if I am not clear !

The run term is probably wrong but means in this case :

rng.randint(1, np.iinfo(np.int32).max)

Alright I get it now...

Still I don't like the way this test is written. I think we should rewrite it completely. For instance by fitting a model on a random target, with and without early stopping on a validation split. The accuracy measured on (X_train, y_train) and the value of the n_iter_ attribute of the model without validation set early stopping should be much higher than the model with early stopping.

It would not test exactly the same thing but it would be more maintainable.

I made it such that there's no change in the model. I explained below that SGD regressors have no use of the seed attribute of the generated dataset. So I think it's better to just document that in the PR and leave this test as is. In addition the test you propose is not guaranteed, but only quite probable.

ogrisel · 2021-03-19T14:32:55Z

You will also need to document the fix in doc/whats_new/v0.24.rst for the 0.24.2 release.

ogrisel · 2021-03-30T16:28:28Z

@PierreAttard gentle reminder that there are still some comments to address in this PR ;)

PierreAttard · 2021-03-30T17:46:34Z

Yes, indeed ! I'll try to continue the work next week !

glemaitre · 2021-04-21T22:48:03Z

I replaced the test as @ogrisel mentioned and added an entry in what's new.

@ogrisel can you have a quick look to see if the test reflect what you had in mind.

PierreAttard · 2021-04-22T06:18:55Z

Thanks a lot @glemaitre . I didn't have much time. I check as soon as possible.

glemaitre · 2021-04-22T07:34:31Z

I see that the CIs are failing. It should be linked to the fact that we changed the way the random state is working.
This is not a big deal, we need to mention it in the what's new but I am thinking that it could be better to move this PR to be included in 1.0 instead of 0.24.2 because it could be weird to have a change of behaviour in a bug fix release.

@ogrisel WDYT?

PierreAttard · 2021-05-01T10:52:54Z

Hi @glemaitre, I launched the test test_validation_set_not_used_for_training several times.
Most of time, the 4 tests pass but some times, one of them does not pass, and only for SGDClassifier estimator.

Did you note that too ?
Shouldn't we make it reproducible ?

Sorry again for the delay of the review.

glemaitre · 2021-07-27T13:02:16Z

I solved the conflict in this PR and make the test deterministic by creating a deterministic dataset indeed.

glemaitre

LGTM

…ataset

adrinjalali

otherwise LGTM

sklearn/linear_model/_passive_aggressive.py

sklearn/linear_model/tests/test_sgd.py

adrinjalali · 2021-08-22T08:53:19Z

Please merge with main and fix the merge conflict as well :)

jjerphan

Thank you for working on this fix, @PierreAttard.

Here are some suggestions, a few of which might be corrected when merging main in this PR branch.

doc/developers/utilities.rst

doc/whats_new/v0.24.rst

doc/whats_new/v1.0.rst

examples/decomposition/plot_varimax_fa.py

examples/feature_selection/plot_rfe_with_cross_validation.py

sklearn/linear_model/_passive_aggressive.py

sklearn/linear_model/tests/test_sgd.py

jjerphan · 2021-11-08T15:09:33Z

Hello @PierreAttard,

Do you still have some time to work on this PR?

Signed-off-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>

jeremiedbb · 2022-03-23T15:52:59Z

Turns out that

make_dataset creates a SequentialDataset which has a seed attribute. This attr is only used to perform random sampling through the random method of the dataset.
Shuffling of the datset is done directly passing a pointer to a (de facto) mutable seed to the shuffle method of the dataset.

Hence, passing the random_state to make_dataset in _fit_regressor only has an effect on the random sampling part.
However, in _plain_sgd (called by _fit_regressor) there's no random sampling on the dataset (only shuffling). So this change has absolutely no effect.

I updated the doc of make_dataset and of the seed attribute of the dataset.

everything's changed

ogrisel · 2022-03-31T16:05:33Z

I added a new non regression test and I confirm that there was no non-determinism bug on main and that this PR does not change this behavior.

sklearn/linear_model/tests/test_sgd.py

ogrisel

I closed #19603 because there is no bug in the end but +1 for merging this PR to make the code less surprising.

ogrisel · 2022-03-31T16:11:58Z

Note, I have checked that the new test passes with SKLEARN_TESTS_GLOBAL_RANDOM_SEED="all".

jjerphan

LGTM. Thank you @PierreAttard and @jeremiedbb.

jeremiedbb · 2022-04-01T22:54:41Z

The failure should be unrelated to the PR (see #23014) but it's not triggered in other PRs. I'm starting it again to see if it's deterministic

jeremiedbb · 2022-04-02T12:06:00Z

That random failure was really weird...
Let's still merge this PR and try to find what's going on in #23014

jeremiedbb · 2022-04-02T12:07:41Z

Thanks @PierreAttard !

…19716) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>

check_random_state moved up in order to use it in make_dataset and un…

39f071c

…itest adaptation

github-actions bot added the module:linear_model label Mar 18, 2021

ogrisel reviewed Mar 19, 2021

View reviewed changes

ogrisel modified the milestones: 0.24, 0.24.2 Mar 19, 2021

glemaitre self-requested a review April 21, 2021 21:17

glemaitre added 3 commits April 22, 2021 00:30

address ogrisel comments

4053a37

Merge remote-tracking branch 'origin/main' into pr/PierreAttard/19716

c012680

DOC add whats new

2598055

glemaitre modified the milestones: 0.24.2, 1.0 Apr 22, 2021

glemaitre self-assigned this Jul 27, 2021

glemaitre added 2 commits July 27, 2021 14:40

Merge remote-tracking branch 'origin/main' into pr/PierreAttard/19716

edc334b

TST make data generation deterministic

b5240f8

glemaitre previously approved these changes Jul 27, 2021

View reviewed changes

docstring

ff23028

glemaitre removed their assignment Jul 27, 2021

Merge branch 'scikit-learn:main' into random-state-not-used-in-make_d…

91a4838

…ataset

glemaitre changed the title ~~check_random_state moved up in order to use it in make_dataset and improve reproduction~~ FIX make creation of dataset deterministic in SGD Jul 29, 2021

adrinjalali reviewed Aug 22, 2021

View reviewed changes

sklearn/linear_model/_passive_aggressive.py Outdated Show resolved Hide resolved

sklearn/linear_model/tests/test_sgd.py Outdated Show resolved Hide resolved

Merge branch 'main' into random-state-not-used-in-make_dataset

5afc2ae

jjerphan previously requested changes Oct 8, 2021

View reviewed changes

lorentzenchr added the Stalled label Feb 8, 2022

jeremiedbb added 5 commits March 23, 2022 12:09

Merge branch 'master' into pr/PierreAttard/19716

eef0170

merge clean-up

9a99e09

more cln

b7f8a3c

better document how random_state is used in dataset

4ba1966

Signed-off-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>

no model is changed

83c299f

Signed-off-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>

jjerphan self-requested a review March 23, 2022 16:24

jeremiedbb removed Stalled cython labels Mar 23, 2022

TST add non regression test about random seed control for SGD models

62f8d45

ogrisel added the No Changelog Needed label Mar 31, 2022

jeremiedbb reviewed Mar 31, 2022

View reviewed changes

sklearn/linear_model/tests/test_sgd.py Outdated Show resolved Hide resolved

Update sklearn/linear_model/tests/test_sgd.py

ba51907

ogrisel mentioned this pull request Mar 31, 2022

_fit_regressor in stochastic_gradient.py does not use random state for call to make_dataset #19603

Closed

ogrisel approved these changes Mar 31, 2022

View reviewed changes

jjerphan approved these changes Apr 1, 2022

View reviewed changes

jeremiedbb merged commit b4da3b4 into scikit-learn:main Apr 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX make creation of dataset deterministic in SGD #19716

FIX make creation of dataset deterministic in SGD #19716

PierreAttard commented Mar 18, 2021

ogrisel left a comment

ogrisel Mar 19, 2021

ogrisel Mar 19, 2021 •

edited

Loading

ogrisel Mar 19, 2021 •

edited

Loading

PierreAttard Mar 19, 2021

PierreAttard Mar 19, 2021

ogrisel Mar 19, 2021 •

edited

Loading

jeremiedbb Mar 23, 2022

ogrisel commented Mar 19, 2021

ogrisel commented Mar 30, 2021

PierreAttard commented Mar 30, 2021

glemaitre commented Apr 21, 2021

PierreAttard commented Apr 22, 2021

glemaitre commented Apr 22, 2021

PierreAttard commented May 1, 2021 •

edited

Loading

glemaitre commented Jul 27, 2021

glemaitre left a comment

adrinjalali left a comment

adrinjalali commented Aug 22, 2021

jjerphan left a comment

jjerphan commented Nov 8, 2021

jeremiedbb commented Mar 23, 2022 •

edited

Loading

ogrisel commented Mar 31, 2022

ogrisel left a comment

ogrisel commented Mar 31, 2022

jjerphan left a comment

jeremiedbb commented Apr 1, 2022

jeremiedbb commented Apr 2, 2022

jeremiedbb commented Apr 2, 2022

FIX make creation of dataset deterministic in SGD #19716

FIX make creation of dataset deterministic in SGD #19716

Conversation

PierreAttard commented Mar 18, 2021

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel Mar 19, 2021

Choose a reason for hiding this comment

ogrisel Mar 19, 2021 • edited Loading

Choose a reason for hiding this comment

ogrisel Mar 19, 2021 • edited Loading

Choose a reason for hiding this comment

PierreAttard Mar 19, 2021

Choose a reason for hiding this comment

Any other comments?

PierreAttard Mar 19, 2021

Choose a reason for hiding this comment

ogrisel Mar 19, 2021 • edited Loading

Choose a reason for hiding this comment

jeremiedbb Mar 23, 2022

Choose a reason for hiding this comment

ogrisel commented Mar 19, 2021

ogrisel commented Mar 30, 2021

PierreAttard commented Mar 30, 2021

glemaitre commented Apr 21, 2021

PierreAttard commented Apr 22, 2021

glemaitre commented Apr 22, 2021

PierreAttard commented May 1, 2021 • edited Loading

glemaitre commented Jul 27, 2021

glemaitre left a comment

Choose a reason for hiding this comment

adrinjalali left a comment

Choose a reason for hiding this comment

adrinjalali commented Aug 22, 2021

jjerphan left a comment

Choose a reason for hiding this comment

jjerphan commented Nov 8, 2021

jeremiedbb commented Mar 23, 2022 • edited Loading

ogrisel commented Mar 31, 2022

ogrisel left a comment

Choose a reason for hiding this comment

ogrisel commented Mar 31, 2022

jjerphan left a comment

Choose a reason for hiding this comment

jeremiedbb commented Apr 1, 2022

jeremiedbb commented Apr 2, 2022

jeremiedbb commented Apr 2, 2022

ogrisel Mar 19, 2021 •

edited

Loading

ogrisel Mar 19, 2021 •

edited

Loading

ogrisel Mar 19, 2021 •

edited

Loading

PierreAttard commented May 1, 2021 •

edited

Loading

jeremiedbb commented Mar 23, 2022 •

edited

Loading