Prevent division by zero in GPR when y_train is constant #18388

boricles · 2020-09-13T01:18:06Z

Reference Issues/PRs

Fixes #18318
Regression in GP standard deviation where y_train.std() == 0
The normalize_y=True option which is used now divides out the standard deviation of the y data, not just subtracting the mean. When there is just one data point, this results in a NaN

What does this implement/fix? Explain your changes.

Add a very small number to the y_train.std to avoid a divide by zero error.

Any other comments?

…ror - fix line length to pass lin test

cmarmo

Thanks @boricles for your pull request.
A first comment.

cmarmo · 2020-09-14T14:15:27Z

sklearn/gaussian_process/_gpr.py

-            y = (y - self._y_train_mean) / self._y_train_std
+            # Moreover, add a very small number to the y_train.std to
+            # avoid a divide by zero error.
+            y = (y - self._y_train_mean) / (self._y_train_std + 1E-19)


Line 203 will always add an epsilon even when _y_train_std is not 0.
Do you mind adding an if check as this will be done only when _y_train_std==0?

Yard1 · 2020-09-14T18:11:27Z

Didn't notice this one was open so I made my own: #18397 - closed it now. My approach was:

        if self.normalize_y:
            self._y_train_mean = np.mean(y, axis=0)
            self._y_train_std = np.std(y, axis=0)
            self._y_train_std = self._y_train_std if self._y_train_std else 1

(last line added)

I was wondering if it wouldn't be a better way to use 1 if std is 0. If std is 0, then that means all data is completely uniform. Therefore, I feel it would make more sense to just use 1. That is also the approach used in scale in sklearn, so it would be consistent. Furthermore, because it is saved (and not just used once in division) and equal to 1, it won't cause any issues when reversing normalization, which happens later on. Thoughts?

boricles · 2020-09-14T20:13:27Z

@cmarmo Thanks for your comment. I included your suggestion. However, the codecov/patch test is not passing "Added line #L204 was not covered by tests" ... seems to be we need to add a new test?

@Yard1, as I said in my first comment, I am starting contributing to sklearn. However, I will go for consistency, so if you are telling us "That is also the approach used in scale in sklearn, i.e., just use 1"; probably we should follow the same approach here.

cmarmo · 2020-09-16T07:38:37Z

... seems to be we need to add a new test?

@boricles, yes, an if condition has been added so you have to test that it is working as expected. This test will be useful whichever implementation you decide to go for (you might want to use the failing case in comment#15782).

@Yard1 I prefer the "adding a very small number" solution, but I'm not the expert here so maybe @plgreenLIRU (the author of #15782) might want to comment here.

Anyway, rather than setting an arbitrary small amount, the numpy.finfo(dtype).eps parameter would be useful here. It is often used in sklearn to define small amounts in close-to-0 issues (see eg

scikit-learn/sklearn/decomposition/_nmf.py

Line 24 in 2b79665

EPSILON = np.finfo(np.float32).eps

)

cmarmo · 2020-09-16T07:40:48Z

sklearn/gaussian_process/_gpr.py

-
+            # Moreover, add a very small number to the y_train.std to
+            # avoid a divide by zero error.
+            if self._y_train_std.all() == 0:


Indeed, I forgot you may have multiple targets: you might want to add epsilon only to target having zero std.

plgreenLIRU · 2020-09-17T09:01:11Z

Hi, happy to help if needed. Potentially stupid question first; why would someone try to perform GP regression on a single data point?

cmarmo · 2020-09-17T09:20:28Z

Hi, happy to help if needed. Potentially stupid question first; why would someone try to perform GP regression on a single data point?

Thanks! To answer your question, apparently this could happen in Bayesian optimization.. (see references in the correspondent issue #18318). I'm not an expert, but I think a division by zero should be prevented here, anyway... :). Is adding an epsilon a distortion of your contribution?

plgreenLIRU · 2020-09-18T08:57:38Z

I've not had a chance to check the issue but setting std equal to 1 seems like a clean solution; I don't think it really matters though. I think the important thing would be for the comments to give a bit more detail about why the if statement has been included. It might be a bit vague at the moment?

cmarmo · 2020-09-18T09:42:53Z

@boricles , I'm happy with y_std=1 then, as long as you take into account the fact that you may have multiple outputs (see also #18300 and the related issue #18065) and you add a test. Thanks for your patience!

boricles · 2020-09-19T22:22:27Z

Thanks @cmarmo, @Yard1, @plgreenLIRU for your feedback and help!, I will go over latest comment

@boricles , I'm happy with y_std=1 then, as long as you take into account the fact that you may have multiple outputs (see also #18300 and the related issue #18065) and you add a test. Thanks for your patience!

I will push the agreed changes first, and next the associated test.
Probably I will need help with the test.
If I have any question I will let you know.

… zero error

alfaro96

Thank you @boricles for your PR!

I have a few comments and suggestions.

alfaro96 · 2020-09-20T07:56:08Z

sklearn/gaussian_process/_gpr.py

+                self._y_train_std = np.asarray(
+                        [std if std else 1 for std in self._y_train_std])


It is preferable to use boolean masks:

Suggested change

self._y_train_std = np.asarray(

[std if std else 1 for std in self._y_train_std])

self._y_train_std[self._y_train_std == 0] = 1

alfaro96 · 2020-09-20T08:00:20Z

sklearn/gaussian_process/_gpr.py

+            # assign _y_train.std to 1 when _y_train.std is zero
+            # to avoid divide by zero error


Nit:

Suggested change

# assign _y_train.std to 1 when _y_train.std is zero

# to avoid divide by zero error

# Assign a standard deviation of one to constant

# targets for avoiding division by zero errors

alfaro96 · 2020-09-20T08:01:25Z

sklearn/gaussian_process/_gpr.py

+                self._y_train_std = \
+                    self._y_train_std if self._y_train_std else 1


Although it would work, I think that for clarity we should use:

Suggested change

self._y_train_std = \

self._y_train_std if self._y_train_std else 1

self._y_train_std = (

self._y_train_std if self._y_train_std == 0 else 1)

…e, and the use of boolean masks

boricles · 2020-09-20T09:34:43Z

Thanks @alfaro96 for your suggestions!. I included all of them.
@cmarmo including @alfaro96 suggestions not sure we still need to add a test. Seems to be all checks are passing now.

alfaro96 · 2020-09-20T09:37:56Z

Thanks @alfaro96 for your suggestions!. I included all of them.
@cmarmo including @alfaro96 suggestions not sure we still need to add a test. Seems to be all checks are passing now.

It is required to test that, for constant target(s), the private _y_train_std attribute has a standard deviation of 1 instead of 0, which are the changes of this PR.

…ttribute has a standard deviation of 1

boricles · 2020-09-21T20:13:20Z

@alfaro96 thanks again! I included a test.

cmarmo · 2021-02-05T08:26:25Z

Hi @boricles , thanks for your patience! Are you still interested in finishing this pull request? If so do you mind synchronizing with upstream? The renaming of the main branch made some checks to fail. Thanks!

afonari · 2021-03-16T22:21:37Z

Can I create a new PR with the exact same changes and give all the credit to @boricles, only for this to merge?! =)

ogrisel

Please also document the fix in doc/whats_new/v0.24.rst targeting 0.24.2.

ogrisel · 2021-03-17T09:36:53Z

sklearn/gaussian_process/_gpr.py

+                self._y_train_std[self._y_train_std == 0] = 1
+            else:
+                self._y_train_std = (
+                    self._y_train_std if self._y_train_std != 0 else 1)


Could you please use as done _handle_zeros_in_scale in #19361 instead?

ogrisel · 2021-03-17T09:43:56Z

sklearn/gaussian_process/tests/test_gpr.py

+    gpr.fit(X, y)
+    _y_train_std = gpr._y_train_std
+
+    assert_array_equal(_y_train_std, expected_std)


Please expand the test and call predict on the fitted model as done in #19361 and give credit to @sobkevich in the changelog.

ogrisel · 2021-03-17T09:48:08Z

Can I create a new PR with the exact same changes and give all the credit to @boricles, only for this to merge?! =)

@afonari feel free to do so and address the comments above if neither of @boricles or @sobkevich are available to update their PRs.

boricles · 2021-03-27T23:55:47Z

Hi @cmarmo, @ogrisel, @afonari
Apologies, but currently I am quite busy trying to finish a project. Thanks for your work!
If you need sth from my side, I can reply on weekends from 1:00 AM CEST.

add very small number to the y_train.std to avoid a divide by zero error

4b0d894

github-actions bot added the module:gaussian_process label Sep 13, 2020

add very small number to the y_train.std to avoid a divide by zero er…

2bc2c92

…ror - fix line length to pass lin test

cmarmo reviewed Sep 14, 2020

View reviewed changes

adding an if check to add epsilon only when _y_train_std==0

7b91bc6

proper if check included to add epsilon only when _y_train_std==0

b0d2115

cmarmo reviewed Sep 16, 2020

View reviewed changes

kiudee mentioned this pull request Sep 18, 2020

Update dependencies kiudee/bayes-skopt#90

Closed

3 tasks

assign _y_train.std to 1 when _y_train.std is zero to avoid divide by…

84a78f4

… zero error

alfaro96 suggested changes Sep 20, 2020

View reviewed changes

bm.villazon added 2 commits September 20, 2020 08:56

suggestions, by alfaro96, included to better understanding of the cod…

23b06b9

…e, and the use of boolean masks

minor fix, includ missing whitespace around operator

fc332cd

bm.villazon added 2 commits September 20, 2020 18:08

Included test for constant target to check the private _y_train_std a…

ea55e6a

…ttribute has a standard deviation of 1

Minor fixes after checking with flake8

63ca232

cmarmo added the Waiting for Reviewer label Oct 20, 2020

cmarmo mentioned this pull request Nov 13, 2020

Don't divide by zero in _gpr.py #18831

Closed

Base automatically changed from master to main January 22, 2021 10:53

cmarmo mentioned this pull request Feb 7, 2021

Handling division by zero std for GaussianProcessRegressor #19361

Closed

th-yoon mentioned this pull request Feb 17, 2021

Fix to check if len(y) > 1 when normalize option is enabled in GaussianProcess._gpr.fit #19474

Closed

sqbl mentioned this pull request Mar 2, 2021

Reset Normalize_y = True novonordisk-research/ProcessOptimizer#8

Closed

ogrisel reviewed Mar 17, 2021

View reviewed changes

ogrisel changed the title ~~add very small number to the y_train.std to avoid a divide by zero error~~ Prevent division by zero in GPR when y_train is constant Mar 17, 2021

afonari mentioned this pull request Mar 17, 2021

Prevent division by zero in GPR when y_train is constant #19703

Merged

cmarmo added Superseded PR has been replace by a newer PR and removed Waiting for Reviewer labels Mar 25, 2021

thomasjpfan closed this Apr 8, 2021

		self._y_train_std = np.asarray(
		[std if std else 1 for std in self._y_train_std])

	self._y_train_std = np.asarray(
	[std if std else 1 for std in self._y_train_std])
	self._y_train_std[self._y_train_std == 0] = 1

		# assign _y_train.std to 1 when _y_train.std is zero
		# to avoid divide by zero error

		self._y_train_std = \
		self._y_train_std if self._y_train_std else 1

Uh oh!

Prevent division by zero in GPR when y_train is constant #18388

Prevent division by zero in GPR when y_train is constant #18388

Uh oh!

Conversation

boricles commented Sep 13, 2020

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

cmarmo left a comment

Choose a reason for hiding this comment

Uh oh!

cmarmo Sep 14, 2020

Choose a reason for hiding this comment

Uh oh!

Yard1 commented Sep 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

boricles commented Sep 14, 2020

Uh oh!

cmarmo commented Sep 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmarmo Sep 16, 2020

Choose a reason for hiding this comment

Uh oh!

plgreenLIRU commented Sep 17, 2020

Uh oh!

cmarmo commented Sep 17, 2020

Uh oh!

plgreenLIRU commented Sep 18, 2020

Uh oh!

cmarmo commented Sep 18, 2020

Uh oh!

boricles commented Sep 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alfaro96 left a comment

Choose a reason for hiding this comment

Uh oh!

alfaro96 Sep 20, 2020

Choose a reason for hiding this comment

Uh oh!

alfaro96 Sep 20, 2020

Choose a reason for hiding this comment

Uh oh!

alfaro96 Sep 20, 2020

Choose a reason for hiding this comment

Uh oh!

boricles commented Sep 20, 2020

Uh oh!

alfaro96 commented Sep 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

boricles commented Sep 21, 2020

Uh oh!

cmarmo commented Feb 5, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

afonari commented Mar 16, 2021

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

ogrisel Mar 17, 2021

Choose a reason for hiding this comment

Uh oh!

ogrisel Mar 17, 2021

Choose a reason for hiding this comment

Uh oh!

ogrisel commented Mar 17, 2021

Uh oh!

boricles commented Mar 27, 2021

Uh oh!

Uh oh!

Yard1 commented Sep 14, 2020 •

edited

Loading

cmarmo commented Sep 16, 2020 •

edited

Loading

boricles commented Sep 19, 2020 •

edited

Loading

alfaro96 commented Sep 20, 2020 •

edited

Loading

cmarmo commented Feb 5, 2021 •

edited

Loading