[MRG] Fixes: Predicted standard deviation values of Gaussian Processes are only within [0, 1] #15782

plgreenLIRU · 2019-12-04T13:41:36Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Currently, kernels do not have a vertical length scale parameters and the 'noise variance' is not optimised during training.
This means that the current GP is only applicable to output data that has unit variance.
Unfortunately normalise_y only removes the mean from the signal - it should also divide by the standard deviation.

Any other comments?

A symptom of this bug is that the std of the predictions can only be between 0 and 1 (maths elaborated in original discussion, but happy to discuss further).
Potentially, quite a serious problem, as the GP will only work properly on output data that has unit variance. Fortunately the fix should be quite simple.
Original discussion already showed that the fix provides much more sensible confidence bounds. For completeness, dev1.py validates the proposed change against GPy (which is a sensible benchmark). Note that the current implementation is a long way off the GPy predictions for this case.

from sklearn.gaussian_process import GaussianProcessRegressor as GP
from sklearn.gaussian_process.kernels import RBF
from matplotlib import pyplot as plt
import numpy as np
import GPy

"""
Verifying the proposed solution against GPy. 
"""

# Set amplitude of signal that we'll use as training data
A = 10

# Pick a kernel
RBF_params = {'length_scale' : 1.0}
kernel = RBF(**RBF_params)

# Test function
def f(x):
    return A * np.sin(x)

# Test case (from test_gpr.py code)
X = np.atleast_2d([1., 3., 5., 6., 7., 8.]).T
y = f(X).ravel()
X_star = np.vstack(np.linspace(1, 8, 100))

# GP2 (y normalised)
gp = GP(kernel=kernel, normalize_y=True)
gp.fit(X, y)
y_pred, y_pred_std = gp.predict(X_star, return_std=True)

# GPy analysis for verification
kernel_gpy = GPy.kern.RBF(input_dim=1, lengthscale=1.)
gpy = GPy.models.GPRegression(X, np.vstack(y), kernel_gpy)
gpy.optimize()
y_pred_gpy, y_var_gpy = gpy.predict(X_star)
y_pred_std_gpy = np.sqrt(y_var_gpy)

# Plot some results
plt.figure()
plt.plot(X_star, y_pred, 'k', label='sklearn (proposed solution)')
plt.plot(X_star, y_pred + 3 * y_pred_std, 'k')
plt.plot(X_star, y_pred - 3 * y_pred_std, 'k')
plt.plot(X_star, y_pred_gpy, 'b', label='GPy')
plt.plot(X_star, y_pred_gpy + 3 * y_pred_std_gpy, 'b')
plt.plot(X_star, y_pred_gpy - 3 * y_pred_std_gpy, 'b')
plt.plot(X, y, 'k o')
plt.legend()
plt.show()

Some of the tests will need to be rewritten as well. For example, dev2.py shows why 'test_y_normalisation.py' won't work now - a GP applied with normalize_y=True will be the same as a GP with normalize_y=False only if the original data was already unit variance.

from sklearn.gaussian_process import GaussianProcessRegressor as GP
from sklearn.gaussian_process.kernels import RBF
from matplotlib import pyplot as plt
import numpy as np
import GPy

"""
Illustrating a problem with test_y_normalization
"""

# Set amplitude of signal that we'll use as training data
A = 10

# Pick a kernel
RBF_params = {'length_scale' : 1.0}
kernel = RBF(**RBF_params)

# Test function
def f(x):
    return A * (x * np.sin(x))

# Test case (from test_gpr.py code)
X = np.atleast_2d([1., 3., 5., 6., 7., 8.]).T
y = f(X).ravel()
X_star = np.vstack(np.linspace(1, 8, 100))

# GP1 (y not normalised)
gp1 = GP(kernel=kernel, normalize_y=False)
gp1.fit(X, y)
y_pred1, y_pred_std1 = gp1.predict(X_star, return_std=True)

# GP2 (y normalised)
gp2 = GP(kernel=kernel, normalize_y=True)
gp2.fit(X, y)
y_pred2, y_pred_std2 = gp2.predict(X_star, return_std=True)

# GPy analysis for verification
kernel_gpy = GPy.kern.RBF(input_dim=1, lengthscale=1.)
gpy = GPy.models.GPRegression(X, np.vstack(y), kernel_gpy)
gpy.optimize()
y_pred_gpy, y_var_gpy = gpy.predict(X_star)
y_pred_std_gpy = np.sqrt(y_var_gpy)

# Plot some results
plt.figure()
plt.plot(X_star, y_pred1, 'k', label='sklearn, normalise_y=False')
plt.plot(X_star, y_pred1 + 3 * y_pred_std1, 'k')
plt.plot(X_star, y_pred1 - 3 * y_pred_std1, 'k')
plt.plot(X_star, y_pred2, 'r', label='sklearn, normalise_y=True')
plt.plot(X_star, y_pred2 + 3 * y_pred_std2, 'r')
plt.plot(X_star, y_pred2 - 3 * y_pred_std2, 'r')
plt.plot(X_star, y_pred_gpy, 'b', label='GPy')
plt.plot(X_star, y_pred_gpy + 3 * y_pred_std_gpy, 'b')
plt.plot(X_star, y_pred_gpy - 3 * y_pred_std_gpy, 'b')
plt.plot(X, y, 'k o')
plt.legend()
plt.show()

I'm happy to take this forward but it's worth seeing we're all on the same page before I put more time into it?

Amir-Arsalan · 2019-12-04T15:19:15Z

@plgreenLIRU

Reference Issues/PRs

Fixes #15612

What does this implement/fix? Explain your changes.
* Currently, kernels do not have a vertical length scale parameters and the 'noise variance' is not optimised during training.

Just to make it clear, does this PR fix this issue? I just looked at the changed file but I don't see any changes that seem to be addressing this issue. Or maybe I'm not able to connect the dots for how the current changes addresses the length_scale parameter issue.

plgreenLIRU · 2019-12-04T16:26:17Z

Hi @Amir-Arsalan. So, to be clear, the kernel has no vertical length scale parameter. This is fine, so long as y has variance 1. To address this, we set

y_norm = (y - mean(y)) / std(y)

We then apply our GP to y_norm as usual, before undoing the normalisation after calculating the predictive mean and standard deviation.

The bug is that the current code sets y_norm = y - mean(y)

rth

Thanks for looking into this @plgreenLIRU !

The solution seems reasonable enough, but I should spend some time getting more familiar with this implementation and the math behind it.

On the API side I find that it's quite alarming that the default normalize_y=False makes such strong assumptions about the mean and std of y and silently produces wrong results if they are not met. Doing some failure analysis to understand how we got here, could be useful (not asking to do it as part of the review, I might look into it).

On the code style side, I have added your two examples under details section of the PR description. Would you mind re-formulating them as unit tests that could for instance compare the predictions with pre-calculated GPy predictions for the same input (checking with assert_allclose up to some tolerance), so they would run without needing Gpy. You can include GPy code used to generate reference results in the associated code comments.

plgreenLIRU · 2019-12-04T19:23:53Z

Ah great, I'm really pleased that this is useful. That's a good idea using GPy as a reference in a test. I'm happy to work on this - my time's a bit random but hopefully a week or two will be enough. Feel free to message if you want to talk about maths :)

plgreenLIRU · 2019-12-07T11:31:34Z

OK @rth I've had a go - let me know what you think when you can.

jnothman

Just to be sure we're on the same track: can you otherwise learn this scaling as the parameter of a kernel?

jnothman

Just to be sure we're on the same track: can you otherwise learn this scaling as the parameter of a kernel?

jnothman · 2019-12-08T19:54:55Z

sklearn/gaussian_process/_gpr.py

-        Whether the target values y are normalized, i.e., the mean of the
-        observed target values become zero. This parameter should be set to
-        True if the target values' mean is expected to differ considerable from
-        zero. When enabled, the normalization effectively modifies the GP's


Why is this comment about the theoretical concerns of using this parameter removed?

Now this is a tricky one! Partly because discussions around the likelihood principle can go on for ever...

Really my concern here is a practical one. The comment seems to be telling the user that they should not normalise the data but, as we now know, the model only actually provides predictions whose standard deviation can only be between 0 and 1. So, normalisation is essential in this case, given the model's limitations. I think it's important that we don't put the user off from using the normalisation as, for this implementation, it's really crucial.

jnothman · 2019-12-08T19:55:44Z

sklearn/gaussian_process/_gpr.py

+        Whether the target values y are normalized, the mean and variance of
+        the target values are set equal to 0 and 1 respectively. This is
+        recommended for cases where zero-mean, unit-variance priors are used.
+        Note that, in the following, the normalisation is reversed before the


What does "the following" refer to?

Sorry I thought I'd replied to this. I just meant 'in the follow code'. Basically, trying to tell the user that the predictive results will be 'un-normalised' before being reported back.

I don't usually assume that the reader is looking in the code file, but often at pre-compiled documentation, so "in this implementation" might be more appropriate.

Please also add a .. versionchanged:: 0.23 tag.

plgreenLIRU · 2019-12-09T14:38:53Z

Just to be sure we're on the same track: can you otherwise learn this scaling as the parameter of a kernel?

Yes I think we can. We would simply multiply the existing kernel by another hyperparameter so, for example, the RBF kernel would then be

k(xi, xj) = \sigma^2_v \exp(-1/(2 * l^2) * (xi - xj)^2)

(where \sigma_v^2 is our extra hyperparameter). We can then learn \sigma_v^2 but, in fact, it's trivial as we can find a closed-form solution for the maximum likelihood \simga_v^2. I've tried to put some maths into a picture here:

I have to admit, I assumed that normalising y to have unit variance would be equivalent to setting \sigma_v = 1 but when I sat down and went through the maths (above) it's not quite the case. In fact, it looks like my assumption becomes increasingly valid for smaller length scales (until \tilde{K} becomes identity).

So including an extra hyperparameter, \sigma_v, could be another way of solving this issue. I'd be surprised if it was worth the extra complexity relative to just normalising y though.

plgreenLIRU · 2019-12-16T08:54:14Z

Hi @jnothman @rth is there anything else you would like me to change? I've got a few free days coming up but will then be off for Christmas so thought I should check.

rth · 2019-12-17T04:59:20Z

Thanks for the reminder @plgreenLIRU . Sorry I won't have the availability to look at this PR in detail this week, next week will be better (and Joel mentioned a similar situation in general I think). No worries if it takes longer for you to respond then.

adrinjalali · 2020-01-06T09:52:13Z

Could this also be related to this one? #11663

plgreenLIRU · 2020-01-07T08:56:08Z

@adrinjalali I will try to take a look this week, just a bit busy (already somehow!)

jnothman · 2020-01-09T07:45:11Z

sklearn/gaussian_process/tests/test_gpr.py

+    Here we test that, when noramlize_y=True, our GP can produce a
+    sensible fit to training data whose variance is significantly
+    larger than unity. This test was made in response to issue #15612.
+


rm blank line

jnothman · 2020-01-09T07:45:16Z

sklearn/gaussian_process/tests/test_gpr.py

+    using GPy which, here, is treated as the 'gold standard'. Note that we
+    only investigate the RBF kernel here, as that is what was used in the
+    GPy implementation.
+


rm blank line

Done. Also corrected a small typo in the test description.

jnothman · 2020-01-09T07:53:00Z

sklearn/gaussian_process/tests/test_gpr.py

+    # Based on numerical experiments, it's reasonable to expect our
+    # GP's mean predictions to get within 7% of predictions of those
+    # made by GPy.
+    assert_allclose(y_pred, y_pred_gpy, rtol=0.07, atol=0)


This assertion fails at master. Is it correct that we should expect the mean prediction to change with this PR?

By changing the amplitude of y, we also change the optimum hyperparameters so yes, this would also alter the mean predictions.

A really hand-wavy explanation: for the higher amplitude cases two points that are next to each other on the x-axis will now be further apart on the y-axis, so can be viewed as 'less correlated'. Running the current code on data generated from

y = x * sin(x)

gives us lengthscale 1.587, but running the code on data generated from

y = 10 * (x * sin(x))

gives us lengthscale 0.93, for example. Like I said, very hand-wavy, but the optimum hyperparameters are different between the two cases.

It's worth noting that this change brings the predictions made by the current code and GPy much closer but they won't be the same, as GPy is also using the vertical lengthscale parameter in the radial basis function kernel. Initially I thought that the normalisation I'm suggesting here would make the optimum vertical lengthscale equal to 1 (so the difference wouldn't matter) but the maths I did earlier showed that this isn't the case. It is very close to 1 though.

I think that the normalisation helps to improve the fidelity of the GP a lot. After that, the introduction of a vertical lengthscale parameter would improve it even more, but not by as much. If you like, I'd be happy to take a look at introducing a vertical length scale parameter later, as a separate PR in the future?

plgreenLIRU · 2020-01-09T15:12:14Z

Could this also be related to this one? #11663

@adrinjalali

Possibly... but I'm having trouble recreating that issue. I've commented on #11663 itself in case it isn't related to the current thread.

jnothman

I think this is good, as long as we're sure it's a bug fix.

Please add an entry to the change log at doc/whats_new/v0.23.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

jnothman · 2020-01-20T12:17:55Z

sklearn/gaussian_process/_gpr.py

+        Whether the target values y are normalized, the mean and variance of
+        the target values are set equal to 0 and 1 respectively. This is
+        recommended for cases where zero-mean, unit-variance priors are used.
+        Note that, in the following, the normalisation is reversed before the


I don't usually assume that the reader is looking in the code file, but often at pre-compiled documentation, so "in this implementation" might be more appropriate.

jnothman · 2020-01-20T12:19:21Z

sklearn/gaussian_process/_gpr.py

+        Whether the target values y are normalized, the mean and variance of
+        the target values are set equal to 0 and 1 respectively. This is
+        recommended for cases where zero-mean, unit-variance priors are used.
+        Note that, in the following, the normalisation is reversed before the


Please also add a .. versionchanged:: 0.23 tag.

jnothman · 2020-01-20T12:20:08Z

Is there a reason the title still marks this as WIP? Is there other work you need to do before this is safe to merge?

plgreenLIRU · 2020-01-21T16:05:54Z

Sorry, can't get rid of the conflict with v0.23.rst ?

jnothman · 2020-01-22T07:26:53Z

doc/whats_new/v0.23.rst

+  :pr: `15503` by :user:`Sam Dixon` <sam-dixon>.
+
+- |Fix| Fixed bug in :class:`gaussian_process.GaussianProcessRegressor` that
+  caused predicted standard deviations to only be between 0 and 1. :pr:`15782`


if this was only an issue when normalize=True, please say so,.

So you'd still have predicted variance between [0, 1] whether or not the data is normalised - normalising the data is the solution. The only exception to this is if WhiteKernel is being used, but that would only by applied if the training data was noisy (measurement noise etc.)

I made a small change to the text.

plgreenLIRU · 2020-02-08T18:54:54Z

Hi guys, would you like me to do anything else to this?

jnothman · 2020-02-09T11:56:02Z

Hi guys, would you like me to do anything else to this?

Probably! But you'll have to wait for another reviewer, unfortunately.

adrinjalali

Very happy to have you contribute to the GP module, Hope you stick around ;)

adrinjalali · 2020-02-24T14:12:42Z

doc/whats_new/v0.23.rst

@@ -89,8 +89,13 @@ Changelog
 ...............................

 - |Enhancement| :func:`gaussian_process.kernels.Matern` returns the RBF kernel when ``nu=np.inf``.
-  :pr:`15503` by :user:`Sam Dixon <sam-dixon>`.
-
+  :pr: `15503` by :user:`Sam Dixon` <sam-dixon>.


Suggested change

:pr: `15503` by :user:`Sam Dixon` <sam-dixon>.

:pr:`15503` by :user:`Sam Dixon` <sam-dixon>.

I think there's a space added here by mistake :)

adrinjalali · 2020-02-24T14:13:24Z

doc/whats_new/v0.23.rst

-  :pr:`16076` by :user:`Guillaume Lemaitre <glemaitre>` and  
+  :pr:`16076` by :user:`Guillaume Lemaitre <glemaitre>` and


please try to avoid unrelated changes in your future PRs.

Hia - I think the 2 above weren't me but came from a merge with master?

adrinjalali · 2020-02-24T22:10:59Z

you have some merge conflicts which need to be fixed, then we can merge :)

plgreenLIRU · 2020-02-25T07:15:07Z

OK hopefully alright now - thank you for your patience!

I've enjoyed working with you guys. I have a few more PRs kicking about and happy to help with reviews in the future if you think I can help.

…y within [0, 1] (scikit-learn#15782) * Initial work for WIP pull request * First commit of full solution and new tests * Changes from PR review * Updated changelog and normalize_y description * Updating v0.23.rst with upstream master * Small update to whatsnew Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

canbooo · 2020-05-12T23:31:43Z

I know this is already merged and closed but a workaround with the previous versions was defining all kernels as a Product of e. g. RBF and a constant kernel, thus acquiring the vertical variance through likelihood maximization. Just noting here for future reference.

plgreenLIRU · 2020-05-13T07:41:36Z

Thanks @canbooo. Yeah I didn't appreciate that's how you could get the vertical length scale in there.

…y within [0, 1] (scikit-learn#15782) * Initial work for WIP pull request * First commit of full solution and new tests * Changes from PR review * Updated changelog and normalize_y description * Updating v0.23.rst with upstream master * Small update to whatsnew Co-authored-by: Joel Nothman <joel.nothman@gmail.com>

rkern · 2020-09-01T19:31:19Z

FWIW, this broke a few downstream users where y_train.std() == 0 (e.g. only one datapoint) when being used for Bayesian optimization. This probably needs a guard for such a case (which is common in the Bayesian optimization use case).

bayesian-optimization/BayesianOptimization#243
scikit-optimize/scikit-optimize#939

…ngle point - See: - scikit-optimize/scikit-optimize#939 - scikit-learn/scikit-learn#15782 - Alternative could be to change `y_normalize` given to `GaussianProcessRegressor` in `optimization.backends.skopt.engine` (https://github.com/HunterMcGushion/hyperparameter_hunter/blob/cf4a13765a6c871909418e59596260a7f0628fce/hyperparameter_hunter/optimization/backends/skopt/engine.py#L769) to False instead of True

Initial work for WIP pull request

34cef3a

rth self-requested a review December 4, 2019 17:10

rth reviewed Dec 4, 2019

View reviewed changes

First commit of full solution and new tests

9b154a3

jnothman reviewed Dec 8, 2019

View reviewed changes

jnothman reviewed Jan 9, 2020

View reviewed changes

Changes from PR review

edb6b67

jnothman added the High Priority High priority issues and pull requests label Jan 20, 2020

jnothman approved these changes Jan 20, 2020

View reviewed changes

Amir-Arsalan mentioned this pull request Jan 21, 2020

scikit-learn's Gaussian Process bug might be affecting your results bayesian-optimization/BayesianOptimization#188

Closed

plgreenLIRU changed the title ~~[WIP] Fixes: Predicted standard deviation values of Gaussian Processes are only within [0, 1]~~ [MRG] Fixes: Predicted standard deviation values of Gaussian Processes are only within [0, 1] Jan 21, 2020

plgreenLIRU added 2 commits January 21, 2020 15:49

Updated changelog and normalize_y description

4d154a3

Updating v0.23.rst with upstream master

275eaf1

Merge branch 'master' into issue-15612

04024f5

jnothman added the Waiting for Reviewer label Jan 22, 2020

jnothman reviewed Jan 22, 2020

View reviewed changes

Small update to whatsnew

717209d

jnothman added this to the 0.23 milestone Feb 9, 2020

adrinjalali approved these changes Feb 24, 2020

View reviewed changes

Merging with master

b571754

adrinjalali approved these changes Feb 25, 2020

View reviewed changes

adrinjalali merged commit 4c29be4 into scikit-learn:master Feb 25, 2020

jnothman mentioned this pull request Mar 31, 2020

Gaussian Process issue on training data #16805

Open

plgreenLIRU deleted the issue-15612 branch April 16, 2020 08:12

This was referenced May 20, 2020

[WIP] Fix ImportError for scikit-learn >= 0.23 scikit-optimize/scikit-optimize#904

Closed

Version incompatibility with latest scikit-learn. scikit-optimize/scikit-optimize#902

Closed

jnothman mentioned this pull request Sep 1, 2020

Regression in GP standard deviation where y_train.std() == 0 #18318

Closed

cmarmo mentioned this pull request Sep 16, 2020

Prevent division by zero in GPR when y_train is constant #18388

Closed

	:pr: `15503` by :user:`Sam Dixon` <sam-dixon>.
	:pr:`15503` by :user:`Sam Dixon` <sam-dixon>.

		:pr:`16076` by :user:`Guillaume Lemaitre <glemaitre>` and
		:pr:`16076` by :user:`Guillaume Lemaitre <glemaitre>` and

Uh oh!

[MRG] Fixes: Predicted standard deviation values of Gaussian Processes are only within [0, 1] #15782

[MRG] Fixes: Predicted standard deviation values of Gaussian Processes are only within [0, 1] #15782

Uh oh!

Conversation

plgreenLIRU commented Dec 4, 2019 • edited by rth Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

Amir-Arsalan commented Dec 4, 2019

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Uh oh!

plgreenLIRU commented Dec 4, 2019

Uh oh!

rth left a comment

Choose a reason for hiding this comment

Uh oh!

plgreenLIRU commented Dec 4, 2019

Uh oh!

plgreenLIRU commented Dec 7, 2019

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

plgreenLIRU commented Dec 9, 2019

Uh oh!

plgreenLIRU commented Dec 16, 2019

Uh oh!

rth commented Dec 17, 2019

Uh oh!

adrinjalali commented Jan 6, 2020

Uh oh!

plgreenLIRU commented Jan 7, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

plgreenLIRU commented Jan 9, 2020

Uh oh!

jnothman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jan 20, 2020

Uh oh!

plgreenLIRU commented Jan 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

plgreenLIRU commented Dec 4, 2019 •

edited by rth

Loading