Skip to content

Prevent division by zero in GPR when y_train is constant #18388

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

boricles
Copy link
Contributor

Reference Issues/PRs

Fixes #18318
Regression in GP standard deviation where y_train.std() == 0
The normalize_y=True option which is used now divides out the standard deviation of the y data, not just subtracting the mean. When there is just one data point, this results in a NaN

What does this implement/fix? Explain your changes.

Add a very small number to the y_train.std to avoid a divide by zero error.

Any other comments?

Copy link
Contributor

@cmarmo cmarmo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @boricles for your pull request.
A first comment.

y = (y - self._y_train_mean) / self._y_train_std
# Moreover, add a very small number to the y_train.std to
# avoid a divide by zero error.
y = (y - self._y_train_mean) / (self._y_train_std + 1E-19)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 203 will always add an epsilon even when _y_train_std is not 0.
Do you mind adding an if check as this will be done only when _y_train_std==0?

@Yard1
Copy link

Yard1 commented Sep 14, 2020

Didn't notice this one was open so I made my own: #18397 - closed it now. My approach was:

        if self.normalize_y:
            self._y_train_mean = np.mean(y, axis=0)
            self._y_train_std = np.std(y, axis=0)
            self._y_train_std = self._y_train_std if self._y_train_std else 1

(last line added)

I was wondering if it wouldn't be a better way to use 1 if std is 0. If std is 0, then that means all data is completely uniform. Therefore, I feel it would make more sense to just use 1. That is also the approach used in scale in sklearn, so it would be consistent. Furthermore, because it is saved (and not just used once in division) and equal to 1, it won't cause any issues when reversing normalization, which happens later on. Thoughts?

@boricles
Copy link
Contributor Author

@cmarmo Thanks for your comment. I included your suggestion. However, the codecov/patch test is not passing "Added line #L204 was not covered by tests" ... seems to be we need to add a new test?

@Yard1, as I said in my first comment, I am starting contributing to sklearn. However, I will go for consistency, so if you are telling us "That is also the approach used in scale in sklearn, i.e., just use 1"; probably we should follow the same approach here.

@cmarmo
Copy link
Contributor

cmarmo commented Sep 16, 2020

... seems to be we need to add a new test?

@boricles, yes, an if condition has been added so you have to test that it is working as expected. This test will be useful whichever implementation you decide to go for (you might want to use the failing case in comment#15782).

@Yard1 I prefer the "adding a very small number" solution, but I'm not the expert here so maybe @plgreenLIRU (the author of #15782) might want to comment here.

Anyway, rather than setting an arbitrary small amount, the numpy.finfo(dtype).eps parameter would be useful here. It is often used in sklearn to define small amounts in close-to-0 issues (see eg

EPSILON = np.finfo(np.float32).eps
)


# Moreover, add a very small number to the y_train.std to
# avoid a divide by zero error.
if self._y_train_std.all() == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I forgot you may have multiple targets: you might want to add epsilon only to target having zero std.

@plgreenLIRU
Copy link
Contributor

Hi, happy to help if needed. Potentially stupid question first; why would someone try to perform GP regression on a single data point?

@cmarmo
Copy link
Contributor

cmarmo commented Sep 17, 2020

Hi, happy to help if needed. Potentially stupid question first; why would someone try to perform GP regression on a single data point?

Thanks! To answer your question, apparently this could happen in Bayesian optimization.. (see references in the correspondent issue #18318). I'm not an expert, but I think a division by zero should be prevented here, anyway... :). Is adding an epsilon a distortion of your contribution?

@plgreenLIRU
Copy link
Contributor

I've not had a chance to check the issue but setting std equal to 1 seems like a clean solution; I don't think it really matters though. I think the important thing would be for the comments to give a bit more detail about why the if statement has been included. It might be a bit vague at the moment?

@cmarmo
Copy link
Contributor

cmarmo commented Sep 18, 2020

@boricles , I'm happy with y_std=1 then, as long as you take into account the fact that you may have multiple outputs (see also #18300 and the related issue #18065) and you add a test. Thanks for your patience!

@kiudee kiudee mentioned this pull request Sep 18, 2020
3 tasks
@boricles
Copy link
Contributor Author

boricles commented Sep 19, 2020

Thanks @cmarmo, @Yard1, @plgreenLIRU for your feedback and help!, I will go over latest comment

@boricles , I'm happy with y_std=1 then, as long as you take into account the fact that you may have multiple outputs (see also #18300 and the related issue #18065) and you add a test. Thanks for your patience!

I will push the agreed changes first, and next the associated test.
Probably I will need help with the test.
If I have any question I will let you know.

Copy link
Member

@alfaro96 alfaro96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @boricles for your PR!

I have a few comments and suggestions.

Comment on lines 202 to 203
self._y_train_std = np.asarray(
[std if std else 1 for std in self._y_train_std])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is preferable to use boolean masks:

Suggested change
self._y_train_std = np.asarray(
[std if std else 1 for std in self._y_train_std])
self._y_train_std[self._y_train_std == 0] = 1

Comment on lines 199 to 200
# assign _y_train.std to 1 when _y_train.std is zero
# to avoid divide by zero error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
# assign _y_train.std to 1 when _y_train.std is zero
# to avoid divide by zero error
# Assign a standard deviation of one to constant
# targets for avoiding division by zero errors

Comment on lines 205 to 206
self._y_train_std = \
self._y_train_std if self._y_train_std else 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although it would work, I think that for clarity we should use:

Suggested change
self._y_train_std = \
self._y_train_std if self._y_train_std else 1
self._y_train_std = (
self._y_train_std if self._y_train_std == 0 else 1)

@boricles
Copy link
Contributor Author

Thanks @alfaro96 for your suggestions!. I included all of them.
@cmarmo including @alfaro96 suggestions not sure we still need to add a test. Seems to be all checks are passing now.

@alfaro96
Copy link
Member

alfaro96 commented Sep 20, 2020

Thanks @alfaro96 for your suggestions!. I included all of them.
@cmarmo including @alfaro96 suggestions not sure we still need to add a test. Seems to be all checks are passing now.

It is required to test that, for constant target(s), the private _y_train_std attribute has a standard deviation of 1 instead of 0, which are the changes of this PR.

@boricles
Copy link
Contributor Author

@alfaro96 thanks again! I included a test.

Base automatically changed from master to main January 22, 2021 10:53
@cmarmo
Copy link
Contributor

cmarmo commented Feb 5, 2021

Hi @boricles , thanks for your patience! Are you still interested in finishing this pull request? If so do you mind synchronizing with upstream? The renaming of the main branch made some checks to fail. Thanks!

@afonari
Copy link
Contributor

afonari commented Mar 16, 2021

Can I create a new PR with the exact same changes and give all the credit to @boricles, only for this to merge?! =)

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also document the fix in doc/whats_new/v0.24.rst targeting 0.24.2.

self._y_train_std[self._y_train_std == 0] = 1
else:
self._y_train_std = (
self._y_train_std if self._y_train_std != 0 else 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please use as done _handle_zeros_in_scale in #19361 instead?

gpr.fit(X, y)
_y_train_std = gpr._y_train_std

assert_array_equal(_y_train_std, expected_std)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please expand the test and call predict on the fitted model as done in #19361 and give credit to @sobkevich in the changelog.

@ogrisel
Copy link
Member

ogrisel commented Mar 17, 2021

Can I create a new PR with the exact same changes and give all the credit to @boricles, only for this to merge?! =)

@afonari feel free to do so and address the comments above if neither of @boricles or @sobkevich are available to update their PRs.

@ogrisel ogrisel changed the title add very small number to the y_train.std to avoid a divide by zero error Prevent division by zero in GPR when y_train is constant Mar 17, 2021
@cmarmo cmarmo added Superseded PR has been replace by a newer PR and removed Waiting for Reviewer labels Mar 25, 2021
@boricles
Copy link
Contributor Author

Hi @cmarmo, @ogrisel, @afonari
Apologies, but currently I am quite busy trying to finish a project. Thanks for your work!
If you need sth from my side, I can reply on weekends from 1:00 AM CEST.

@thomasjpfan thomasjpfan closed this Apr 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:gaussian_process Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regression in GP standard deviation where y_train.std() == 0
8 participants