Skip to content

[MRG] Fixes: Predicted standard deviation values of Gaussian Processes are only within [0, 1] #15782

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 25, 2020

Conversation

plgreenLIRU
Copy link
Contributor

@plgreenLIRU plgreenLIRU commented Dec 4, 2019

Reference Issues/PRs

Fixes #15612

What does this implement/fix? Explain your changes.

  • Currently, kernels do not have a vertical length scale parameters and the 'noise variance' is not optimised during training.

  • This means that the current GP is only applicable to output data that has unit variance.

  • Unfortunately normalise_y only removes the mean from the signal - it should also divide by the standard deviation.

Any other comments?

  • A symptom of this bug is that the std of the predictions can only be between 0 and 1 (maths elaborated in original discussion, but happy to discuss further).

  • Potentially, quite a serious problem, as the GP will only work properly on output data that has unit variance. Fortunately the fix should be quite simple.

  • Original discussion already showed that the fix provides much more sensible confidence bounds. For completeness, dev1.py validates the proposed change against GPy (which is a sensible benchmark). Note that the current implementation is a long way off the GPy predictions for this case.

from sklearn.gaussian_process import GaussianProcessRegressor as GP
from sklearn.gaussian_process.kernels import RBF
from matplotlib import pyplot as plt
import numpy as np
import GPy

"""
Verifying the proposed solution against GPy. 
"""

# Set amplitude of signal that we'll use as training data
A = 10

# Pick a kernel
RBF_params = {'length_scale' : 1.0}
kernel = RBF(**RBF_params)

# Test function
def f(x):
    return A * np.sin(x)

# Test case (from test_gpr.py code)
X = np.atleast_2d([1., 3., 5., 6., 7., 8.]).T
y = f(X).ravel()
X_star = np.vstack(np.linspace(1, 8, 100))

# GP2 (y normalised)
gp = GP(kernel=kernel, normalize_y=True)
gp.fit(X, y)
y_pred, y_pred_std = gp.predict(X_star, return_std=True)

# GPy analysis for verification
kernel_gpy = GPy.kern.RBF(input_dim=1, lengthscale=1.)
gpy = GPy.models.GPRegression(X, np.vstack(y), kernel_gpy)
gpy.optimize()
y_pred_gpy, y_var_gpy = gpy.predict(X_star)
y_pred_std_gpy = np.sqrt(y_var_gpy)

# Plot some results
plt.figure()
plt.plot(X_star, y_pred, 'k', label='sklearn (proposed solution)')
plt.plot(X_star, y_pred + 3 * y_pred_std, 'k')
plt.plot(X_star, y_pred - 3 * y_pred_std, 'k')
plt.plot(X_star, y_pred_gpy, 'b', label='GPy')
plt.plot(X_star, y_pred_gpy + 3 * y_pred_std_gpy, 'b')
plt.plot(X_star, y_pred_gpy - 3 * y_pred_std_gpy, 'b')
plt.plot(X, y, 'k o')
plt.legend()
plt.show()
  • Some of the tests will need to be rewritten as well. For example, dev2.py shows why 'test_y_normalisation.py' won't work now - a GP applied with normalize_y=True will be the same as a GP with normalize_y=False only if the original data was already unit variance.
from sklearn.gaussian_process import GaussianProcessRegressor as GP
from sklearn.gaussian_process.kernels import RBF
from matplotlib import pyplot as plt
import numpy as np
import GPy

"""
Illustrating a problem with test_y_normalization
"""

# Set amplitude of signal that we'll use as training data
A = 10

# Pick a kernel
RBF_params = {'length_scale' : 1.0}
kernel = RBF(**RBF_params)

# Test function
def f(x):
    return A * (x * np.sin(x))

# Test case (from test_gpr.py code)
X = np.atleast_2d([1., 3., 5., 6., 7., 8.]).T
y = f(X).ravel()
X_star = np.vstack(np.linspace(1, 8, 100))

# GP1 (y not normalised)
gp1 = GP(kernel=kernel, normalize_y=False)
gp1.fit(X, y)
y_pred1, y_pred_std1 = gp1.predict(X_star, return_std=True)

# GP2 (y normalised)
gp2 = GP(kernel=kernel, normalize_y=True)
gp2.fit(X, y)
y_pred2, y_pred_std2 = gp2.predict(X_star, return_std=True)

# GPy analysis for verification
kernel_gpy = GPy.kern.RBF(input_dim=1, lengthscale=1.)
gpy = GPy.models.GPRegression(X, np.vstack(y), kernel_gpy)
gpy.optimize()
y_pred_gpy, y_var_gpy = gpy.predict(X_star)
y_pred_std_gpy = np.sqrt(y_var_gpy)

# Plot some results
plt.figure()
plt.plot(X_star, y_pred1, 'k', label='sklearn, normalise_y=False')
plt.plot(X_star, y_pred1 + 3 * y_pred_std1, 'k')
plt.plot(X_star, y_pred1 - 3 * y_pred_std1, 'k')
plt.plot(X_star, y_pred2, 'r', label='sklearn, normalise_y=True')
plt.plot(X_star, y_pred2 + 3 * y_pred_std2, 'r')
plt.plot(X_star, y_pred2 - 3 * y_pred_std2, 'r')
plt.plot(X_star, y_pred_gpy, 'b', label='GPy')
plt.plot(X_star, y_pred_gpy + 3 * y_pred_std_gpy, 'b')
plt.plot(X_star, y_pred_gpy - 3 * y_pred_std_gpy, 'b')
plt.plot(X, y, 'k o')
plt.legend()
plt.show()

I'm happy to take this forward but it's worth seeing we're all on the same page before I put more time into it?

@Amir-Arsalan
Copy link

@plgreenLIRU

Reference Issues/PRs

Fixes #15612

What does this implement/fix? Explain your changes.

* Currently, kernels do not have a vertical length scale parameters and the 'noise variance' is not optimised during training.

Just to make it clear, does this PR fix this issue? I just looked at the changed file but I don't see any changes that seem to be addressing this issue. Or maybe I'm not able to connect the dots for how the current changes addresses the length_scale parameter issue.

@plgreenLIRU
Copy link
Contributor Author

Hi @Amir-Arsalan. So, to be clear, the kernel has no vertical length scale parameter. This is fine, so long as y has variance 1. To address this, we set

y_norm = (y - mean(y)) / std(y)

We then apply our GP to y_norm as usual, before undoing the normalisation after calculating the predictive mean and standard deviation.

The bug is that the current code sets y_norm = y - mean(y)

@rth rth self-requested a review December 4, 2019 17:10
Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this @plgreenLIRU !

The solution seems reasonable enough, but I should spend some time getting more familiar with this implementation and the math behind it.

On the API side I find that it's quite alarming that the default normalize_y=False makes such strong assumptions about the mean and std of y and silently produces wrong results if they are not met. Doing some failure analysis to understand how we got here, could be useful (not asking to do it as part of the review, I might look into it).

On the code style side, I have added your two examples under details section of the PR description. Would you mind re-formulating them as unit tests that could for instance compare the predictions with pre-calculated GPy predictions for the same input (checking with assert_allclose up to some tolerance), so they would run without needing Gpy. You can include GPy code used to generate reference results in the associated code comments.

@plgreenLIRU
Copy link
Contributor Author

Ah great, I'm really pleased that this is useful. That's a good idea using GPy as a reference in a test. I'm happy to work on this - my time's a bit random but hopefully a week or two will be enough. Feel free to message if you want to talk about maths :)

@plgreenLIRU
Copy link
Contributor Author

OK @rth I've had a go - let me know what you think when you can.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure we're on the same track: can you otherwise learn this scaling as the parameter of a kernel?

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be sure we're on the same track: can you otherwise learn this scaling as the parameter of a kernel?

Whether the target values y are normalized, i.e., the mean of the
observed target values become zero. This parameter should be set to
True if the target values' mean is expected to differ considerable from
zero. When enabled, the normalization effectively modifies the GP's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this comment about the theoretical concerns of using this parameter removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this is a tricky one! Partly because discussions around the likelihood principle can go on for ever...

Really my concern here is a practical one. The comment seems to be telling the user that they should not normalise the data but, as we now know, the model only actually provides predictions whose standard deviation can only be between 0 and 1. So, normalisation is essential in this case, given the model's limitations. I think it's important that we don't put the user off from using the normalisation as, for this implementation, it's really crucial.

Whether the target values y are normalized, the mean and variance of
the target values are set equal to 0 and 1 respectively. This is
recommended for cases where zero-mean, unit-variance priors are used.
Note that, in the following, the normalisation is reversed before the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "the following" refer to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I thought I'd replied to this. I just meant 'in the follow code'. Basically, trying to tell the user that the predictive results will be 'un-normalised' before being reported back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't usually assume that the reader is looking in the code file, but often at pre-compiled documentation, so "in this implementation" might be more appropriate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add a .. versionchanged:: 0.23 tag.

@plgreenLIRU
Copy link
Contributor Author

Just to be sure we're on the same track: can you otherwise learn this scaling as the parameter of a kernel?

Yes I think we can. We would simply multiply the existing kernel by another hyperparameter so, for example, the RBF kernel would then be

k(xi, xj) = \sigma^2_v \exp(-1/(2 * l^2) * (xi - xj)^2)

(where \sigma_v^2 is our extra hyperparameter). We can then learn \sigma_v^2 but, in fact, it's trivial as we can find a closed-form solution for the maximum likelihood \simga_v^2. I've tried to put some maths into a picture here:

Capture

I have to admit, I assumed that normalising y to have unit variance would be equivalent to setting \sigma_v = 1 but when I sat down and went through the maths (above) it's not quite the case. In fact, it looks like my assumption becomes increasingly valid for smaller length scales (until \tilde{K} becomes identity).

So including an extra hyperparameter, \sigma_v, could be another way of solving this issue. I'd be surprised if it was worth the extra complexity relative to just normalising y though.

@plgreenLIRU
Copy link
Contributor Author

Hi @jnothman @rth is there anything else you would like me to change? I've got a few free days coming up but will then be off for Christmas so thought I should check.

@rth
Copy link
Member

rth commented Dec 17, 2019

Thanks for the reminder @plgreenLIRU . Sorry I won't have the availability to look at this PR in detail this week, next week will be better (and Joel mentioned a similar situation in general I think). No worries if it takes longer for you to respond then.

@adrinjalali
Copy link
Member

Could this also be related to this one? #11663

@plgreenLIRU
Copy link
Contributor Author

@adrinjalali I will try to take a look this week, just a bit busy (already somehow!)

Here we test that, when noramlize_y=True, our GP can produce a
sensible fit to training data whose variance is significantly
larger than unity. This test was made in response to issue #15612.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

using GPy which, here, is treated as the 'gold standard'. Note that we
only investigate the RBF kernel here, as that is what was used in the
GPy implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm blank line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also corrected a small typo in the test description.

# Based on numerical experiments, it's reasonable to expect our
# GP's mean predictions to get within 7% of predictions of those
# made by GPy.
assert_allclose(y_pred, y_pred_gpy, rtol=0.07, atol=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion fails at master. Is it correct that we should expect the mean prediction to change with this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By changing the amplitude of y, we also change the optimum hyperparameters so yes, this would also alter the mean predictions.

A really hand-wavy explanation: for the higher amplitude cases two points that are next to each other on the x-axis will now be further apart on the y-axis, so can be viewed as 'less correlated'. Running the current code on data generated from

y = x * sin(x)

gives us lengthscale 1.587, but running the code on data generated from

y = 10 * (x * sin(x))

gives us lengthscale 0.93, for example. Like I said, very hand-wavy, but the optimum hyperparameters are different between the two cases.

It's worth noting that this change brings the predictions made by the current code and GPy much closer but they won't be the same, as GPy is also using the vertical lengthscale parameter in the radial basis function kernel. Initially I thought that the normalisation I'm suggesting here would make the optimum vertical lengthscale equal to 1 (so the difference wouldn't matter) but the maths I did earlier showed that this isn't the case. It is very close to 1 though.

I think that the normalisation helps to improve the fidelity of the GP a lot. After that, the introduction of a vertical lengthscale parameter would improve it even more, but not by as much. If you like, I'd be happy to take a look at introducing a vertical length scale parameter later, as a separate PR in the future?

@plgreenLIRU
Copy link
Contributor Author

Could this also be related to this one? #11663

@adrinjalali

Possibly... but I'm having trouble recreating that issue. I've commented on #11663 itself in case it isn't related to the current thread.

@jnothman jnothman added the High Priority High priority issues and pull requests label Jan 20, 2020
Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good, as long as we're sure it's a bug fix.

Please add an entry to the change log at doc/whats_new/v0.23.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:

Whether the target values y are normalized, the mean and variance of
the target values are set equal to 0 and 1 respectively. This is
recommended for cases where zero-mean, unit-variance priors are used.
Note that, in the following, the normalisation is reversed before the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't usually assume that the reader is looking in the code file, but often at pre-compiled documentation, so "in this implementation" might be more appropriate.

Whether the target values y are normalized, the mean and variance of
the target values are set equal to 0 and 1 respectively. This is
recommended for cases where zero-mean, unit-variance priors are used.
Note that, in the following, the normalisation is reversed before the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also add a .. versionchanged:: 0.23 tag.

@jnothman
Copy link
Member

Is there a reason the title still marks this as WIP? Is there other work you need to do before this is safe to merge?

@plgreenLIRU plgreenLIRU changed the title [WIP] Fixes: Predicted standard deviation values of Gaussian Processes are only within [0, 1] [MRG] Fixes: Predicted standard deviation values of Gaussian Processes are only within [0, 1] Jan 21, 2020
@plgreenLIRU
Copy link
Contributor Author

Sorry, can't get rid of the conflict with v0.23.rst ?

:pr: `15503` by :user:`Sam Dixon` <sam-dixon>.

- |Fix| Fixed bug in :class:`gaussian_process.GaussianProcessRegressor` that
caused predicted standard deviations to only be between 0 and 1. :pr:`15782`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this was only an issue when normalize=True, please say so,.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you'd still have predicted variance between [0, 1] whether or not the data is normalised - normalising the data is the solution. The only exception to this is if WhiteKernel is being used, but that would only by applied if the training data was noisy (measurement noise etc.)

I made a small change to the text.

@plgreenLIRU
Copy link
Contributor Author

Hi guys, would you like me to do anything else to this?

@jnothman jnothman added this to the 0.23 milestone Feb 9, 2020
@jnothman
Copy link
Member

jnothman commented Feb 9, 2020

Hi guys, would you like me to do anything else to this?

Probably! But you'll have to wait for another reviewer, unfortunately.

Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very happy to have you contribute to the GP module, Hope you stick around ;)

@@ -89,8 +89,13 @@ Changelog
...............................

- |Enhancement| :func:`gaussian_process.kernels.Matern` returns the RBF kernel when ``nu=np.inf``.
:pr:`15503` by :user:`Sam Dixon <sam-dixon>`.

:pr: `15503` by :user:`Sam Dixon` <sam-dixon>.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:pr: `15503` by :user:`Sam Dixon` <sam-dixon>.
:pr:`15503` by :user:`Sam Dixon` <sam-dixon>.

I think there's a space added here by mistake :)

Comment on lines 145 to 150
:pr:`16076` by :user:`Guillaume Lemaitre <glemaitre>` and
:pr:`16076` by :user:`Guillaume Lemaitre <glemaitre>` and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please try to avoid unrelated changes in your future PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hia - I think the 2 above weren't me but came from a merge with master?

@adrinjalali
Copy link
Member

you have some merge conflicts which need to be fixed, then we can merge :)

@plgreenLIRU
Copy link
Contributor Author

OK hopefully alright now - thank you for your patience!

I've enjoyed working with you guys. I have a few more PRs kicking about and happy to help with reviews in the future if you think I can help.

@adrinjalali adrinjalali merged commit 4c29be4 into scikit-learn:master Feb 25, 2020
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
…y within [0, 1] (scikit-learn#15782)

* Initial work for WIP pull request

* First commit of full solution and new tests

* Changes from PR review

* Updated changelog and normalize_y description

* Updating v0.23.rst with upstream master

* Small update to whatsnew

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
@plgreenLIRU plgreenLIRU deleted the issue-15612 branch April 16, 2020 08:12
@canbooo
Copy link

canbooo commented May 12, 2020

I know this is already merged and closed but a workaround with the previous versions was defining all kernels as a Product of e. g. RBF and a constant kernel, thus acquiring the vertical variance through likelihood maximization. Just noting here for future reference.

@plgreenLIRU
Copy link
Contributor Author

Thanks @canbooo. Yeah I didn't appreciate that's how you could get the vertical length scale in there.

gio8tisu pushed a commit to gio8tisu/scikit-learn that referenced this pull request May 15, 2020
…y within [0, 1] (scikit-learn#15782)

* Initial work for WIP pull request

* First commit of full solution and new tests

* Changes from PR review

* Updated changelog and normalize_y description

* Updating v0.23.rst with upstream master

* Small update to whatsnew

Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
@rkern
Copy link

rkern commented Sep 1, 2020

FWIW, this broke a few downstream users where y_train.std() == 0 (e.g. only one datapoint) when being used for Bayesian optimization. This probably needs a guard for such a case (which is common in the Bayesian optimization use case).

bayesian-optimization/BayesianOptimization#243
scikit-optimize/scikit-optimize#939

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
High Priority High priority issues and pull requests Waiting for Reviewer
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Predicted standard deviation values of Gaussian Processes are only within [0, 1]
7 participants