-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Fixes: Predicted standard deviation values of Gaussian Processes are only within [0, 1] #15782
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Just to make it clear, does this PR fix this issue? I just looked at the changed file but I don't see any changes that seem to be addressing this issue. Or maybe I'm not able to connect the dots for how the current changes addresses the |
Hi @Amir-Arsalan. So, to be clear, the kernel has no vertical length scale parameter. This is fine, so long as y has variance 1. To address this, we set y_norm = (y - mean(y)) / std(y) We then apply our GP to y_norm as usual, before undoing the normalisation after calculating the predictive mean and standard deviation. The bug is that the current code sets y_norm = y - mean(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this @plgreenLIRU !
The solution seems reasonable enough, but I should spend some time getting more familiar with this implementation and the math behind it.
On the API side I find that it's quite alarming that the default normalize_y=False
makes such strong assumptions about the mean and std of y
and silently produces wrong results if they are not met. Doing some failure analysis to understand how we got here, could be useful (not asking to do it as part of the review, I might look into it).
On the code style side, I have added your two examples under details section of the PR description. Would you mind re-formulating them as unit tests that could for instance compare the predictions with pre-calculated GPy predictions for the same input (checking with assert_allclose
up to some tolerance), so they would run without needing Gpy. You can include GPy code used to generate reference results in the associated code comments.
Ah great, I'm really pleased that this is useful. That's a good idea using GPy as a reference in a test. I'm happy to work on this - my time's a bit random but hopefully a week or two will be enough. Feel free to message if you want to talk about maths :) |
OK @rth I've had a go - let me know what you think when you can. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure we're on the same track: can you otherwise learn this scaling as the parameter of a kernel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be sure we're on the same track: can you otherwise learn this scaling as the parameter of a kernel?
Whether the target values y are normalized, i.e., the mean of the | ||
observed target values become zero. This parameter should be set to | ||
True if the target values' mean is expected to differ considerable from | ||
zero. When enabled, the normalization effectively modifies the GP's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this comment about the theoretical concerns of using this parameter removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this is a tricky one! Partly because discussions around the likelihood principle can go on for ever...
Really my concern here is a practical one. The comment seems to be telling the user that they should not normalise the data but, as we now know, the model only actually provides predictions whose standard deviation can only be between 0 and 1. So, normalisation is essential in this case, given the model's limitations. I think it's important that we don't put the user off from using the normalisation as, for this implementation, it's really crucial.
sklearn/gaussian_process/_gpr.py
Outdated
Whether the target values y are normalized, the mean and variance of | ||
the target values are set equal to 0 and 1 respectively. This is | ||
recommended for cases where zero-mean, unit-variance priors are used. | ||
Note that, in the following, the normalisation is reversed before the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "the following" refer to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I thought I'd replied to this. I just meant 'in the follow code'. Basically, trying to tell the user that the predictive results will be 'un-normalised' before being reported back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't usually assume that the reader is looking in the code file, but often at pre-compiled documentation, so "in this implementation" might be more appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add a .. versionchanged:: 0.23
tag.
Yes I think we can. We would simply multiply the existing kernel by another hyperparameter so, for example, the RBF kernel would then be k(xi, xj) = \sigma^2_v \exp(-1/(2 * l^2) * (xi - xj)^2) (where \sigma_v^2 is our extra hyperparameter). We can then learn \sigma_v^2 but, in fact, it's trivial as we can find a closed-form solution for the maximum likelihood \simga_v^2. I've tried to put some maths into a picture here: I have to admit, I assumed that normalising y to have unit variance would be equivalent to setting \sigma_v = 1 but when I sat down and went through the maths (above) it's not quite the case. In fact, it looks like my assumption becomes increasingly valid for smaller length scales (until \tilde{K} becomes identity). So including an extra hyperparameter, \sigma_v, could be another way of solving this issue. I'd be surprised if it was worth the extra complexity relative to just normalising y though. |
Thanks for the reminder @plgreenLIRU . Sorry I won't have the availability to look at this PR in detail this week, next week will be better (and Joel mentioned a similar situation in general I think). No worries if it takes longer for you to respond then. |
Could this also be related to this one? #11663 |
@adrinjalali I will try to take a look this week, just a bit busy (already somehow!) |
Here we test that, when noramlize_y=True, our GP can produce a | ||
sensible fit to training data whose variance is significantly | ||
larger than unity. This test was made in response to issue #15612. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
using GPy which, here, is treated as the 'gold standard'. Note that we | ||
only investigate the RBF kernel here, as that is what was used in the | ||
GPy implementation. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rm blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also corrected a small typo in the test description.
# Based on numerical experiments, it's reasonable to expect our | ||
# GP's mean predictions to get within 7% of predictions of those | ||
# made by GPy. | ||
assert_allclose(y_pred, y_pred_gpy, rtol=0.07, atol=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assertion fails at master. Is it correct that we should expect the mean prediction to change with this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By changing the amplitude of y, we also change the optimum hyperparameters so yes, this would also alter the mean predictions.
A really hand-wavy explanation: for the higher amplitude cases two points that are next to each other on the x-axis will now be further apart on the y-axis, so can be viewed as 'less correlated'. Running the current code on data generated from
y = x * sin(x)
gives us lengthscale 1.587, but running the code on data generated from
y = 10 * (x * sin(x))
gives us lengthscale 0.93, for example. Like I said, very hand-wavy, but the optimum hyperparameters are different between the two cases.
It's worth noting that this change brings the predictions made by the current code and GPy much closer but they won't be the same, as GPy is also using the vertical lengthscale parameter in the radial basis function kernel. Initially I thought that the normalisation I'm suggesting here would make the optimum vertical lengthscale equal to 1 (so the difference wouldn't matter) but the maths I did earlier showed that this isn't the case. It is very close to 1 though.
I think that the normalisation helps to improve the fidelity of the GP a lot. After that, the introduction of a vertical lengthscale parameter would improve it even more, but not by as much. If you like, I'd be happy to take a look at introducing a vertical length scale parameter later, as a separate PR in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good, as long as we're sure it's a bug fix.
Please add an entry to the change log at doc/whats_new/v0.23.rst
. Like the other entries there, please reference this pull request with :pr:
and credit yourself (and other contributors if applicable) with :user:
sklearn/gaussian_process/_gpr.py
Outdated
Whether the target values y are normalized, the mean and variance of | ||
the target values are set equal to 0 and 1 respectively. This is | ||
recommended for cases where zero-mean, unit-variance priors are used. | ||
Note that, in the following, the normalisation is reversed before the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't usually assume that the reader is looking in the code file, but often at pre-compiled documentation, so "in this implementation" might be more appropriate.
sklearn/gaussian_process/_gpr.py
Outdated
Whether the target values y are normalized, the mean and variance of | ||
the target values are set equal to 0 and 1 respectively. This is | ||
recommended for cases where zero-mean, unit-variance priors are used. | ||
Note that, in the following, the normalisation is reversed before the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also add a .. versionchanged:: 0.23
tag.
Is there a reason the title still marks this as WIP? Is there other work you need to do before this is safe to merge? |
Sorry, can't get rid of the conflict with v0.23.rst ? |
doc/whats_new/v0.23.rst
Outdated
:pr: `15503` by :user:`Sam Dixon` <sam-dixon>. | ||
|
||
- |Fix| Fixed bug in :class:`gaussian_process.GaussianProcessRegressor` that | ||
caused predicted standard deviations to only be between 0 and 1. :pr:`15782` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this was only an issue when normalize=True, please say so,.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So you'd still have predicted variance between [0, 1] whether or not the data is normalised - normalising the data is the solution. The only exception to this is if WhiteKernel is being used, but that would only by applied if the training data was noisy (measurement noise etc.)
I made a small change to the text.
Hi guys, would you like me to do anything else to this? |
Probably! But you'll have to wait for another reviewer, unfortunately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very happy to have you contribute to the GP module, Hope you stick around ;)
doc/whats_new/v0.23.rst
Outdated
@@ -89,8 +89,13 @@ Changelog | |||
............................... | |||
|
|||
- |Enhancement| :func:`gaussian_process.kernels.Matern` returns the RBF kernel when ``nu=np.inf``. | |||
:pr:`15503` by :user:`Sam Dixon <sam-dixon>`. | |||
|
|||
:pr: `15503` by :user:`Sam Dixon` <sam-dixon>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:pr: `15503` by :user:`Sam Dixon` <sam-dixon>. | |
:pr:`15503` by :user:`Sam Dixon` <sam-dixon>. |
I think there's a space added here by mistake :)
doc/whats_new/v0.23.rst
Outdated
:pr:`16076` by :user:`Guillaume Lemaitre <glemaitre>` and | ||
:pr:`16076` by :user:`Guillaume Lemaitre <glemaitre>` and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please try to avoid unrelated changes in your future PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hia - I think the 2 above weren't me but came from a merge with master?
you have some merge conflicts which need to be fixed, then we can merge :) |
OK hopefully alright now - thank you for your patience! I've enjoyed working with you guys. I have a few more PRs kicking about and happy to help with reviews in the future if you think I can help. |
…y within [0, 1] (scikit-learn#15782) * Initial work for WIP pull request * First commit of full solution and new tests * Changes from PR review * Updated changelog and normalize_y description * Updating v0.23.rst with upstream master * Small update to whatsnew Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
I know this is already merged and closed but a workaround with the previous versions was defining all kernels as a Product of e. g. RBF and a constant kernel, thus acquiring the vertical variance through likelihood maximization. Just noting here for future reference. |
Thanks @canbooo. Yeah I didn't appreciate that's how you could get the vertical length scale in there. |
…y within [0, 1] (scikit-learn#15782) * Initial work for WIP pull request * First commit of full solution and new tests * Changes from PR review * Updated changelog and normalize_y description * Updating v0.23.rst with upstream master * Small update to whatsnew Co-authored-by: Joel Nothman <joel.nothman@gmail.com>
FWIW, this broke a few downstream users where bayesian-optimization/BayesianOptimization#243 |
…ngle point - See: - scikit-optimize/scikit-optimize#939 - scikit-learn/scikit-learn#15782 - Alternative could be to change `y_normalize` given to `GaussianProcessRegressor` in `optimization.backends.skopt.engine` (https://github.com/HunterMcGushion/hyperparameter_hunter/blob/cf4a13765a6c871909418e59596260a7f0628fce/hyperparameter_hunter/optimization/backends/skopt/engine.py#L769) to False instead of True
Reference Issues/PRs
Fixes #15612
What does this implement/fix? Explain your changes.
Currently, kernels do not have a vertical length scale parameters and the 'noise variance' is not optimised during training.
This means that the current GP is only applicable to output data that has unit variance.
Unfortunately normalise_y only removes the mean from the signal - it should also divide by the standard deviation.
Any other comments?
A symptom of this bug is that the std of the predictions can only be between 0 and 1 (maths elaborated in original discussion, but happy to discuss further).
Potentially, quite a serious problem, as the GP will only work properly on output data that has unit variance. Fortunately the fix should be quite simple.
Original discussion already showed that the fix provides much more sensible confidence bounds. For completeness, dev1.py validates the proposed change against GPy (which is a sensible benchmark). Note that the current implementation is a long way off the GPy predictions for this case.
I'm happy to take this forward but it's worth seeing we're all on the same page before I put more time into it?