-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
FIX make creation of dataset deterministic in SGD #19716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FIX make creation of dataset deterministic in SGD #19716
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add a non-regression test that check that calling SGDRegressor
twice on the same data without waiting for convergence (for instance using max_iter=1
, gives the same model.coef_
and model.intercept_
attribute?
cv = ShuffleSplit(test_size=validation_fraction, | ||
random_state=seed) | ||
random_state=rng) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not understand the motivation behind those changes. Passing seed directly was fine, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I had not read the description of the PR...
I think we need to rework the test to make it less dependent on internal details. Not sure how though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can probably just move the creation of the validation split (the lines that define validation_mask
and validation_score_cb
) just after the call to check_random_state
.
It's still dependent a bit on internal details but the test would appear less convoluted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's what I explain in the PR message.
Any other comments?
I had to adapt the existing unitest
test_validation_set_not_used_for_training
in filetest_sgd.py
:With this change, the random state is "used" one time before the following line :
validation_mask = self._make_validation_split(y)So, I reproduce this effect in the unitest with those changes :
rng = np.random.RandomState(seed) rng.randint(1, np.iinfo(np.int32).max) cv = ShuffleSplit(test_size=validation_fraction, random_state=rng)I am not totally satisfied, what do you think about ?
If you're OK, I add text in what news doc and add a non-regression test to check that the model is fully deterministic when seeding the random state like said @ogrisel
Thanks in advance for responses !
If you use the seed, the results will be different because inside _fit_regressor
method, the random object is run a first time with the given seed
before the use of _make_validation_split
. That is why in the test, in order to reproduce that behavior, I create rng
which is run a first time before the validaiton split. But I am not really convinced.
Sorry if I am not clear !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The run term is probably wrong but means in this case :
rng.randint(1, np.iinfo(np.int32).max)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright I get it now...
Still I don't like the way this test is written. I think we should rewrite it completely. For instance by fitting a model on a random target, with and without early stopping on a validation split. The accuracy measured on (X_train, y_train)
and the value of the n_iter_
attribute of the model without validation set early stopping should be much higher than the model with early stopping.
It would not test exactly the same thing but it would be more maintainable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made it such that there's no change in the model. I explained below that SGD regressors have no use of the seed attribute of the generated dataset. So I think it's better to just document that in the PR and leave this test as is. In addition the test you propose is not guaranteed, but only quite probable.
You will also need to document the fix in |
@PierreAttard gentle reminder that there are still some comments to address in this PR ;) |
Yes, indeed ! I'll try to continue the work next week ! |
Thanks a lot @glemaitre . I didn't have much time. I check as soon as possible. |
I see that the CIs are failing. It should be linked to the fact that we changed the way the random state is working. @ogrisel WDYT? |
Hi @glemaitre, I launched the test Did you note that too ? Sorry again for the delay of the review. |
I solved the conflict in this PR and make the test deterministic by creating a deterministic dataset indeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
otherwise LGTM
Please merge with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this fix, @PierreAttard.
Here are some suggestions, a few of which might be corrected when merging main
in this PR branch.
Hello @PierreAttard, Do you still have some time to work on this PR? |
Signed-off-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
Signed-off-by: jeremie du boisberranger <jeremiedbb@yahoo.fr>
Turns out that
Hence, passing the random_state to I updated the doc of |
everything's changed
I added a new non regression test and I confirm that there was no non-determinism bug on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I closed #19603 because there is no bug in the end but +1 for merging this PR to make the code less surprising.
Note, I have checked that the new test passes with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you @PierreAttard and @jeremiedbb.
The failure should be unrelated to the PR (see #23014) but it's not triggered in other PRs. I'm starting it again to see if it's deterministic |
That random failure was really weird... |
Thanks @PierreAttard ! |
…19716) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
…19716) Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com> Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org> Co-authored-by: Jérémie du Boisberranger <34657725+jeremiedbb@users.noreply.github.com>
Reference Issues/PRs
Fixes #19603
What does this implement/fix? Explain your changes.
check_random_state
moved up beforemake_dataset
in order to use the same random state during the whole process.This change occurs in classes
SGDRegressor, SparseSGDRegressor
,fit
method.Any other comments?
I had to adapt the existing unitest
test_validation_set_not_used_for_training
in filetest_sgd.py
:With this change, the random state is "used" one time before the following line :
So, I reproduce this effect in the unitest with those changes :
I am not totally satisfied, what do you think about ?
If you're OK, I add text in what news doc and add a non-regression test to check that the model is fully deterministic when seeding the random state like said @ogrisel
Thanks in advance for responses !