-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Prevent division by zero in GPR when y_train is constant #18388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ror - fix line length to pass lin test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @boricles for your pull request.
A first comment.
sklearn/gaussian_process/_gpr.py
Outdated
y = (y - self._y_train_mean) / self._y_train_std | ||
# Moreover, add a very small number to the y_train.std to | ||
# avoid a divide by zero error. | ||
y = (y - self._y_train_mean) / (self._y_train_std + 1E-19) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 203 will always add an epsilon even when _y_train_std
is not 0.
Do you mind adding an if check as this will be done only when _y_train_std==0
?
Didn't notice this one was open so I made my own: #18397 - closed it now. My approach was: if self.normalize_y:
self._y_train_mean = np.mean(y, axis=0)
self._y_train_std = np.std(y, axis=0)
self._y_train_std = self._y_train_std if self._y_train_std else 1 (last line added) I was wondering if it wouldn't be a better way to use 1 if std is 0. If std is 0, then that means all data is completely uniform. Therefore, I feel it would make more sense to just use 1. That is also the approach used in |
@cmarmo Thanks for your comment. I included your suggestion. However, the codecov/patch test is not passing "Added line #L204 was not covered by tests" ... seems to be we need to add a new test? @Yard1, as I said in my first comment, I am starting contributing to sklearn. However, I will go for consistency, so if you are telling us "That is also the approach used in scale in sklearn, i.e., just use 1"; probably we should follow the same approach here. |
@boricles, yes, an if condition has been added so you have to test that it is working as expected. This test will be useful whichever implementation you decide to go for (you might want to use the failing case in comment#15782). @Yard1 I prefer the "adding a very small number" solution, but I'm not the expert here so maybe @plgreenLIRU (the author of #15782) might want to comment here. Anyway, rather than setting an arbitrary small amount, the
|
sklearn/gaussian_process/_gpr.py
Outdated
|
||
# Moreover, add a very small number to the y_train.std to | ||
# avoid a divide by zero error. | ||
if self._y_train_std.all() == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I forgot you may have multiple targets: you might want to add epsilon only to target having zero std.
Hi, happy to help if needed. Potentially stupid question first; why would someone try to perform GP regression on a single data point? |
Thanks! To answer your question, apparently this could happen in Bayesian optimization.. (see references in the correspondent issue #18318). I'm not an expert, but I think a division by zero should be prevented here, anyway... :). Is adding an epsilon a distortion of your contribution? |
I've not had a chance to check the issue but setting std equal to 1 seems like a clean solution; I don't think it really matters though. I think the important thing would be for the comments to give a bit more detail about why the if statement has been included. It might be a bit vague at the moment? |
Thanks @cmarmo, @Yard1, @plgreenLIRU for your feedback and help!, I will go over latest comment
I will push the agreed changes first, and next the associated test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @boricles for your PR!
I have a few comments and suggestions.
sklearn/gaussian_process/_gpr.py
Outdated
self._y_train_std = np.asarray( | ||
[std if std else 1 for std in self._y_train_std]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is preferable to use boolean masks:
self._y_train_std = np.asarray( | |
[std if std else 1 for std in self._y_train_std]) | |
self._y_train_std[self._y_train_std == 0] = 1 |
sklearn/gaussian_process/_gpr.py
Outdated
# assign _y_train.std to 1 when _y_train.std is zero | ||
# to avoid divide by zero error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
# assign _y_train.std to 1 when _y_train.std is zero | |
# to avoid divide by zero error | |
# Assign a standard deviation of one to constant | |
# targets for avoiding division by zero errors |
sklearn/gaussian_process/_gpr.py
Outdated
self._y_train_std = \ | ||
self._y_train_std if self._y_train_std else 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although it would work, I think that for clarity we should use:
self._y_train_std = \ | |
self._y_train_std if self._y_train_std else 1 | |
self._y_train_std = ( | |
self._y_train_std if self._y_train_std == 0 else 1) |
…e, and the use of boolean masks
It is required to test that, for constant target(s), the private |
…ttribute has a standard deviation of 1
@alfaro96 thanks again! I included a test. |
Hi @boricles , thanks for your patience! Are you still interested in finishing this pull request? If so do you mind synchronizing with upstream? The renaming of the main branch made some checks to fail. Thanks! |
Can I create a new PR with the exact same changes and give all the credit to @boricles, only for this to merge?! =) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also document the fix in doc/whats_new/v0.24.rst
targeting 0.24.2.
self._y_train_std[self._y_train_std == 0] = 1 | ||
else: | ||
self._y_train_std = ( | ||
self._y_train_std if self._y_train_std != 0 else 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please use as done _handle_zeros_in_scale
in #19361 instead?
gpr.fit(X, y) | ||
_y_train_std = gpr._y_train_std | ||
|
||
assert_array_equal(_y_train_std, expected_std) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please expand the test and call predict
on the fitted model as done in #19361 and give credit to @sobkevich in the changelog.
@afonari feel free to do so and address the comments above if neither of @boricles or @sobkevich are available to update their PRs. |
Reference Issues/PRs
Fixes #18318
Regression in GP standard deviation where y_train.std() == 0
The normalize_y=True option which is used now divides out the standard deviation of the y data, not just subtracting the mean. When there is just one data point, this results in a NaN
What does this implement/fix? Explain your changes.
Add a very small number to the y_train.std to avoid a divide by zero error.
Any other comments?