Skip to content

FIX Use cho_solve when return_std=True for GaussianProcessRegressor #19939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 27, 2021

Conversation

iwhalvic
Copy link
Contributor

Reference Issues/PRs

Fixes Issue #19936

What does this implement/fix? Explain your changes.

Provides more consistency in GaussianProcessRegressor when computing the predictive standard deviation (often taken as a form of predictive uncertainty). Specifically consistency between using return_std=True and return_cov=True. For certain kernel hyperparameters, return_std=True appears unstable.

return_cov=True can be used as an alternative, and the end user can quickly find the standard deviation from the returned predictive covariance, however id the predicted dataset is large the covariance matrix may become too large for memory.

The key to the inconsistency appears to be return_std=True attempting to explicitly form the inverse training kernel matrix. Typically in GP regression this inverse is never explicitly formed, instead the Cholesky decomposition is stored and cholesky solves take the place of the explicit inverse (this is the approach the return_cov=True ) uses.

This fix attempts to improve consistency and stability by using the Cholesky solve in both cases, while still mitigating the large predictive covariance matrix formation when return_std=True.

Changes are localized to _gpr.py

before:

elif return_std:
    # cache result of K_inv computation
    if self._K_inv is None:
        # compute inverse K_inv of K based on its Cholesky
        # decomposition L and its inverse L_inv
        L_inv = solve_triangular(self.L_.T,
                                 np.eye(self.L_.shape[0]))
        self._K_inv = L_inv.dot(L_inv.T)

    # Compute variance of predictive distribution
    y_var = self.kernel_.diag(X)
    y_var -= np.einsum("ij,ij->i",
                       np.dot(K_trans, self._K_inv), K_trans)

after:

elif return_std:
    # cache result of K_inv computation,
    # this is done for compatibility, K_inv should be avoided
    if self._K_inv is None:
        # compute inverse K_inv of K based on its Cholesky
        # decomposition L and its inverse L_inv
        L_inv = solve_triangular(self.L_.T,
                                 np.eye(self.L_.shape[0]))
        self._K_inv = L_inv.dot(L_inv.T)
    v = cho_solve((self.L_, True), K_trans.T)  # Line 5

    # Compute variance of predictive distribution
    # Use einsum to avoid large matrix
    y_var = self.kernel_.diag(X)
    y_var -= np.einsum("ij,ji->i",
                       K_trans, v)

The matrix v has the same dimensions as K_trans.T, so size should not be an issue.

To my understanding the use of np.einsum avoids the formation of the (potentially large) covariance matrix K_trans, v just to get the diagonal entries.

The computation of self._K_inv is left in the code, even though it is useless in the built in workflows. This retains compatibility should an end user be leveraging self._K_inv in some workflow. I will defer to more experienced developers if this is the correct course of action.

Any other comments?

Running the reproducing code in Issue #19936 results in close agreement between the two computation methods now

[0.00754465 0.00633869 0.00642104 0.00834402]
[0.00754465 0.0063387  0.00642104 0.00834401]

@glemaitre
Copy link
Member

Thanks for the PR. I can see that your PR is as well fixing the inconsistency that we saw there: #19703 (comment)

I am going to finish the work on this PR and have a look at your PR.
Once #19703 merged, could change the following test with a new parameterization:

@pytest.mark.parametrize("target", [y, np.ones(X.shape[0], dtype=np.float64)])
def test_predict_cov_vs_std(kernel, target):

In addition, it could be nice to add your minimal example as a non-regression test as well.

@glemaitre glemaitre added Bug To backport PR merged in master that need a backport to a release branch defined based on the milestone. labels Apr 21, 2021
@glemaitre glemaitre added this to the 0.24.2 milestone Apr 21, 2021
@glemaitre
Copy link
Member

Please add an entry to the change log at doc/whats_new/v*.rst. Like the other entries there, please reference this pull request with :pr: and credit yourself (and other contributors if applicable) with :user:.

You can use 0.24.2 for adding an entry.

@iwhalvic
Copy link
Contributor Author

(First contribution to this project, let me know if I have messed up procedures)

Resolved PR comments. Test cases existed in test_gpr.py aimed at testing K_inv, since this is no longer computed it has been removed.

test_predict_cov_vs_std as well as adding my minimal example to test_gpr.py, please let me know is this is nor the correct place for this.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly some style comment but otherwise everything seems correct.

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I am happy with this PR. I did a quick test regarding the speed and I could not point out a regression of the performance. It looks like a win-win.

We need a second review to be able to merge this but I would like to have it in the upcoming bug release 0.24.2

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you @iwhalvic.

I have included minor suggestions.

Comment on lines +560 to +561
Non-regression test for:
https://github.com/scikit-learn/scikit-learn/issues/19936
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Non-regression test for:
https://github.com/scikit-learn/scikit-learn/issues/19936
Non-regression test for issues #19936.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I prefer to have the URL :)

Copy link
Member

@jjerphan jjerphan Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, I just have seen that both styles exist.

@iwhalvic: you can ignore this comment.

Comment on lines +563 to +575
kernel = (C(8.98576054e+05, (1e-12, 1e12)) *
RBF([5.91326520e+02, 1.32584051e+03], (1e-12, 1e12)) +
WhiteKernel(noise_level=1e-5))
gpr = GaussianProcessRegressor(kernel=kernel, alpha=0, optimizer=None)
X_train = np.array([[0., 0.], [1.54919334, -0.77459667], [-1.54919334, 0.],
[0., -1.54919334], [0.77459667, 0.77459667],
[-0.77459667, 1.54919334]])
y_train = np.array([[-2.14882017e-10], [-4.66975823e+00], [4.01823986e+00],
[-1.30303674e+00], [-1.35760156e+00],
[3.31215668e+00]])
gpr.fit(X_train, y_train)
X_test = np.array([[-1.93649167, -1.93649167], [1.93649167, -1.93649167],
[-1.93649167, 1.93649167], [1.93649167, 1.93649167]])
Copy link
Member

@jjerphan jjerphan Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are those values the ones for the case you are describing?

Inconsistencies were observed when the kernel cannot be inverted (or numerically stable).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This kernel is near to be singular: #19939 (comment)
I assume that this fine for a non-regression test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the problem is about diagonal elements of gpr.L_ being close to zero, maybe we could make that explicit, e.g. with:

assert np.diag(gpr.L_).min() < 0.01

although I am not 100% sure this is the true cause of the problem.

Copy link
Member

@ogrisel ogrisel Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe:

assert np.linalg.eigvalsh(kernel(X_train)).min() < 1e-4

Copy link
Member

@jjerphan jjerphan Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition number of gpr.L_ might also be a good proxy to assert if solving this system comes with numerical instability.

In this case, numerical instability is prone be present as cond(gpr.L_) >> 1:

import numpy as np
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C, WhiteKernel

kernel = (C(8.98576054e+05, (1e-12, 1e12)) *
          RBF([5.91326520e+02, 1.32584051e+03], (1e-12, 1e12)) +
          WhiteKernel(noise_level=1e-5))

gpr = GaussianProcessRegressor(kernel=kernel, alpha=0, optimizer=None)

X_train = np.array([[0., 0.], [1.54919334, -0.77459667], [-1.54919334, 0.],
                    [0., -1.54919334], [0.77459667, 0.77459667],
                    [-0.77459667, 1.54919334]])

y_train = np.array([[-2.14882017e-10], [-4.66975823e+00], [4.01823986e+00],
                    [-1.30303674e+00], [-1.35760156e+00],
                    [3.31215668e+00]])

gpr.fit(X_train, y_train)

print(np.linalg.cond(gpr.L_))
720955.9866810681

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep this is exactly what we can observe in the branch main when we try to inverse L_.T with constant target. I print the condition number of L_ and when it fails for the constant target it is due to the fact that L_ is ill-conditioned. When the condition number >> 1, the inversion is numerically instable but the L_ is not singular.

sklearn/gaussian_process/tests/test_gpr.py::test_predict_cov_vs_std[target0-kernel0] 3.6020373678666835
PASSED
sklearn/gaussian_process/tests/test_gpr.py::test_predict_cov_vs_std[target0-kernel1] 4.000792132354216
PASSED
sklearn/gaussian_process/tests/test_gpr.py::test_predict_cov_vs_std[target0-kernel2] 3.6020373678666835
PASSED
sklearn/gaussian_process/tests/test_gpr.py::test_predict_cov_vs_std[target0-kernel3] 20.254629250748767
PASSED
sklearn/gaussian_process/tests/test_gpr.py::test_predict_cov_vs_std[target0-kernel4] 20.254602147788138
PASSED
sklearn/gaussian_process/tests/test_gpr.py::test_predict_cov_vs_std[target0-kernel5] 1.0000016459179157
PASSED
sklearn/gaussian_process/tests/test_gpr.py::test_predict_cov_vs_std[target1-kernel0] 244949.11539454578
FAILED
sklearn/gaussian_process/tests/test_gpr.py::test_predict_cov_vs_std[target1-kernel1] 4.000792132354216
PASSED
sklearn/gaussian_process/tests/test_gpr.py::test_predict_cov_vs_std[target1-kernel2] 244948.4153273236
FAILED
sklearn/gaussian_process/tests/test_gpr.py::test_predict_cov_vs_std[target1-kernel3] 165965.62854978096
FAILED
sklearn/gaussian_process/tests/test_gpr.py::test_predict_cov_vs_std[target1-kernel4] 165956.90214640685
FAILED
sklearn/gaussian_process/tests/test_gpr.py::test_predict_cov_vs_std[target1-kernel5] 165975.30738621883
FAILED

Copy link
Contributor Author

@iwhalvic iwhalvic Apr 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding my comments on what is occurring here:
Covariance matrices must be positive semi-definite, so all eigenvalues >=0
Inversion of matrices require all eigenvalues <>0

So theoretically, we could have a covariance matrix with a 0 eigenvalue -> non-invertible

Minimal example: If I ask you to invert A=[[1 1] [1 1]] you will likely not have an answer. In an analogy to what the previous code using solve_triangular(self.L_.T, np.eye(self.L_.shape[0])) we are asking: "Solve Ax=[1 0]". Once again no solution for x.

However, in the fix code cho_solve((self.L_, True), K_trans.T), but L_ and K_trans are both generated from the same kernel function, so we would not expect to ever be posed with such a problem. Instead we pose a problem such as "Solve Ax=[1 1]", at which point you can probably provide an answer (if not quite a few).

Kernel functions such as the test case form matrices like A=[[1 1] [1 1]] because the length scales of the RBF are quite large, so all points become highly correlated with all other points (almost regardless of distance in the X space). I believe a true linear fit using a GP with an RBF kernel would have a correlation length of inf.

The alpha parameter in GaussianProcessRegressor attempts to avoid this, and with a sufficiently large alpha value I predict it would for this case. But regardless the return_cov=True and return_std=True should be as consistent as possible.

But I would welcome a GP expert to weigh in.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another round of reviews. I agree that it's good to make the 2 return_std / return_cov options consistent but I am not familiar enough with GP to assert that the way of computing return_cov is correct.

Can you please execute all the examples in :

https://scikit-learn.org/dev/auto_examples/index.html#gaussian-process-for-machine-learning

And report here whether or not the plots have changed significantly or not with the fix of your PR, and your analysis on whether or not this is expected/correct.

Here is a quick way to find those examples:

% git grep "return_std=" examples/gaussian_process 
examples/gaussian_process/plot_compare_gpr_krr.py:y_gpr = gpr.predict(X_plot, return_std=False)
examples/gaussian_process/plot_compare_gpr_krr.py:y_gpr, y_std = gpr.predict(X_plot, return_std=True)
examples/gaussian_process/plot_gpr_co2.py:y_pred, y_std = gp.predict(X_, return_std=True)
examples/gaussian_process/plot_gpr_noisy_targets.py:y_pred, sigma = gp.predict(x, return_std=True)
examples/gaussian_process/plot_gpr_noisy_targets.py:y_pred, sigma = gp.predict(x, return_std=True)
examples/gaussian_process/plot_gpr_prior_posterior.py:    y_mean, y_std = gp.predict(X_[:, np.newaxis], return_std=True)
examples/gaussian_process/plot_gpr_prior_posterior.py:    y_mean, y_std = gp.predict(X_[:, np.newaxis], return_std=True)

Ideally it would be great to find a second reviewer with more expertise on GPs than I do. Maybe @jmetzen @plgreenLIRU could give this PR a quick look?

@glemaitre
Copy link
Member

glemaitre commented Apr 22, 2021

And report here whether or not the plots have changed significantly or not with the fix of your PR, and your analysis on whether or not this is expected/correct.

I will have a quick look now. My feeling is that the previous way was working properly when K_trans L_ could be inverted. Issues arise when it is not the case such as the two case with the constant target or the specific kernel provided in the non-regression tests.

But this is better to have a look at the examples.

@glemaitre glemaitre self-assigned this Apr 22, 2021
@glemaitre
Copy link
Member

Actually, we have an interesting example: https://scikit-learn.org/dev/auto_examples/gaussian_process/plot_gpr_prior_posterior.html#sphx-glr-auto-examples-gaussian-process-plot-gpr-prior-posterior-py

You need to look at the standard deviation of the DotProduct kernel (Fig. 8). Computing the kernel is numerically unstable: we get the following warning:

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  _check_optimize_result("lbfgs", opt_res)
/home/circleci/project/sklearn/gaussian_process/_gpr.py:378: UserWarning: Predicted variances smaller than 0. Setting those variances to 0.
  warnings.warn("Predicted variances smaller than 0. "

So we force the negative std. dev. to 0 but we have numerical unstable positive std. dev.

With the current PR, we don't have this issue: no warning and all std. dev. are close to 0 with a nicer plot:

Figure_1

@glemaitre
Copy link
Member

All other examples are the same.

@iwhalvic
Copy link
Contributor Author

Can someone aid in the test errors? It seems just like a timeout, which is hard to explain. Curious if it is a test server problem. I do not believe I have the authority to restart the tests without making dummy commits.

@iwhalvic
Copy link
Contributor Author

Will await PR #19964 merge to retest.

@glemaitre
Copy link
Member

glemaitre commented Apr 23, 2021 via email

@glemaitre glemaitre self-requested a review April 26, 2021 10:29
Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a new look at it. LGTM. @ogrisel do you mind to have a new look at it. We will address the documentation and further improvement in #19952

Copy link
Member

@jeremiedbb jeremiedbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @iwhalvic !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug module:gaussian_process To backport PR merged in master that need a backport to a release branch defined based on the milestone.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants