Skip to content

FIX GaussianProcessRegression(normalize_y=True).predict(X, return_cov=True) #19706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

AidarShakerimoff
Copy link

@AidarShakerimoff AidarShakerimoff commented Mar 17, 2021

Problem with GPR predictions on multitarget data. Fixes issues #17394 and #18065

What does this fix?

Initially, when a multitarget data was given, the _gpr.py was not able to undo the normalisation of y_cov and y_var in the predict() function (lines 355 and 381). The problem was the fact that the code was trying to multiply two horizontal vectors in both cases and returned error ValueError: operands could not be broadcast together with shapes (a,) (b,) (e.g. number of samples 'a' = 4, number of target components 'b' = 3). I solved this by reshaping the y_var vector and rows ofy_covinto vertical form before multiplication with the y_train_std vector. As a result, we obtain y_std of form(n_samples, n_output_dims)and y_cov of form (n_samples, n_samples, n_output_dims). When targets are single values the undo normalisation is performed as usual.
The solution is based on the solution from Rasmussen and Williams' book for cases when classes of outputs are independent from each other.

Comment

This solution will work only if the target classes are independent.

 Problem with GPR predictions on multitarget data.
@afonari
Copy link
Contributor

afonari commented Mar 18, 2021

I'm not with merging rights here, but I think you need tests for this one, see bottom of #19703 for a test example.

@AidarShakerimoff
Copy link
Author

I'm not with merging rights here, but I think you need tests for this one, see bottom of #19703 for a test example.

Thank you, I will do it

@cmarmo
Copy link
Contributor

cmarmo commented Mar 18, 2021

@AidarShakerimoff thanks for your pull request. Linting errors are present. Do you mind running

$ flake8 sklearn/gaussian_process/_gpr.py

on your repository? This will allow you to display all the errors. See also this page for more details about flake8.

@AidarShakerimoff
Copy link
Author

@AidarShakerimoff thanks for your pull request. Linting errors are present. Do you mind running

$ flake8 sklearn/gaussian_process/_gpr.py

on your repository? This will allow you to display all the errors. See also this page for more details about flake8.

Thank you for your advice!

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a new test (or update an existing test) to check that if the target data is already normalized (before fit, for instance using sklearn.preprocessing.scale), passing normalize_y=True or normalize_y=False makes predict return similar results (on toy random datasets generated with make_regression for instance).

Please check with return_cov=True and with multi-target y.

@ogrisel ogrisel changed the title [FIX] of the issue scikit-learn#17394 and scikit-learn#18065 [FIX] GaussianProcessRegression(normalize_y=).predict(X, return_cov=True) Mar 19, 2021
@ogrisel ogrisel changed the title [FIX] GaussianProcessRegression(normalize_y=).predict(X, return_cov=True) [FIX] GaussianProcessRegression(normalize_y=True).predict(X, return_cov=True) Mar 19, 2021
@AidarShakerimoff
Copy link
Author

Please add a new test (or update an existing test) to check that if the target data is already normalized (before fit, for instance using sklearn.preprocessing.scale), passing normalize_y=True or normalize_y=False makes predict return similar results (on toy random datasets generated with make_regression for instance).

Please check with return_cov=True and with multi-target y.

Thank you very much, I will finish it soon.

@AidarShakerimoff
Copy link
Author

By the way, should I also try to handle cases when there is a correlation between classes of the output? Or should keep assuming that the classes are independent?

@AidarShakerimoff
Copy link
Author

How can I solve changelog problem?

@AidarShakerimoff
Copy link
Author

AidarShakerimoff commented Apr 10, 2021

Please add a new test (or update an existing test) to check that if the target data is already normalized (before fit, for instance using sklearn.preprocessing.scale), passing normalize_y=True or normalize_y=False makes predict return similar results (on toy random datasets generated with make_regression for instance).

Please check with return_cov=True and with multi-target y.

I am trying to implement the following code that compares results of two GPR (normalized and not normalized) on the dataset that is already normalized.

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn import preprocessing
from sklearn import datasets
import numpy as np
from sklearn.utils._testing \
   import assert_almost_equal
# toy dataset 
X, y = datasets.make_regression(n_targets=1)
X_train, y_train = X[:50], y[:50]
X_test = X[50:]

# normalize the data before fitting the GP
y_train_norm = preprocessing.normalize(y_train.reshape(y_train.shape[0],-1), axis=0).reshape(y_train.shape)


gp1 = GaussianProcessRegressor(
   normalize_y=True,
)
gp2 = GaussianProcessRegressor(
   normalize_y=False,
)

gp1.fit(X_train, y_train_norm)
gp2.fit(X_train, y_train_norm)

#should provide same results
v1, d1 = gp1.predict(X_test, return_std=True)
v2, d2 = gp2.predict(X_test, return_std=True)

assert_almost_equal(v1,v2)
assert_almost_equal(d1,d2)

_, c1 = gp1.predict(X_test, return_cov=True)

_, c2 = gp2.predict(X_test, return_cov=True)
assert_almost_equal(c1,c2)

The assertion provides error even when I work with single-target data.

---------------------------------------------------------------------------

AssertionError                            Traceback (most recent call last)

<ipython-input-9-66e30f5008d1> in <module>()
     28 v2, d2 = gp2.predict(X_test, return_std=True)
     29 
---> 30 assert_almost_equal(v1,v2)
     31 assert_almost_equal(d1,d2)
     32 

2 frames

/usr/local/lib/python3.7/dist-packages/numpy/testing/_private/utils.py in assert_array_compare(comparison, x, y, err_msg, verbose, header, precision, equal_nan, equal_inf)
    838                                 verbose=verbose, header=header,
    839                                 names=('x', 'y'), precision=precision)
--> 840             raise AssertionError(msg)
    841     except ValueError:
    842         import traceback

AssertionError: 
Arrays are not almost equal to 7 decimals

Mismatched elements: 50 / 50 (100%)
Max absolute difference: 0.00314954
Max relative difference: 1.22647451e+42
 x: array([0.0031495, 0.0031495, 0.0031495, 0.0031495, 0.0031495, 0.0031495,
       0.0031495, 0.0031495, 0.0031495, 0.0031495, 0.0031495, 0.0031495,
       0.0031495, 0.0031495, 0.0031495, 0.0031495, 0.0031495, 0.0031495,...
 y: array([-1.3586266e-39, -1.3203828e-31,  1.0705813e-35, -1.2190218e-34,
        1.8102779e-31, -1.8333309e-42,  2.9537442e-35,  1.2073815e-32,
        8.7449675e-35, -6.3151308e-35, -6.7287241e-33, -1.2166517e-35,...

Does it imply that there is something wrong in the ways how normalization is done in GPR (regardless of target's dimensionality)?

@cmarmo cmarmo added the Bug label Apr 11, 2021
@cmarmo
Copy link
Contributor

cmarmo commented Apr 11, 2021

How can I solve changelog problem?

The check verifies that this pull request is referenced in the changelog if necessary: this is a bugfix so it is worth an entry in the changelog. But I suggest to wait for a reviewer clarifying if this fix is useful in the bugfix release 0.24.2 (as #19703) or should go in 1.0. Thanks for your patience.

@AidarShakerimoff
Copy link
Author

How can I solve changelog problem?

The check verifies that this pull request is referenced in the changelog if necessary: this is a bugfix so it is worth an entry in the changelog. But I suggest to wait for a reviewer clarifying if this fix is useful in the bugfix release 0.24.2 (as #19703) or should go in 1.0. Thanks for your patience.

Thank you for clarification!

@glemaitre
Copy link
Member

I don't want to merge #19703 with code that might not be working properly. Instead, I will add it here:

    # Test multi-target data
    # This test should be uncommented when a solution to is found #19706
    # n_samples = X.shape[0]
    # rng = np.random.RandomState(0)
    # Y_with_constant = np.concatenate([
    #     y[:, np.newaxis],  # non-constant target
    #     np.full(shape=(X.shape[0], 1), fill_value=2)  # constant target
    # ], axis=1)
    #
    # # Ensure no tracebacks
    # gpr.fit(X, Y_with_constant)
    # Y_pred, Y_cov = gpr.predict(X, return_cov=True)
    #
    # assert_all_finite(Y_pred)
    # assert_all_finite(Y_cov)
    #
    # # Assert correct shapes
    # assert y_pred.shape == (X, Y)
    # assert y_cov.shape == (X, Y)

I will just add a FIXME and reference to this PR.

@glemaitre glemaitre requested review from ogrisel and glemaitre and removed request for ogrisel May 10, 2021 10:19
@glemaitre glemaitre changed the title [FIX] GaussianProcessRegression(normalize_y=True).predict(X, return_cov=True) FIX GaussianProcessRegression(normalize_y=True).predict(X, return_cov=True) Dec 16, 2021
@glemaitre glemaitre removed their request for review December 16, 2021 10:10
@glemaitre glemaitre added Superseded PR has been replace by a newer PR and removed Bug labels Dec 16, 2021
@glemaitre
Copy link
Member

Superseded by #21996
I opened this subsequent PR to reuse the outer product that was already working for normalize_y=True.
I also added @AidarShakerimoff as a contributor of the subsequent PR.

@glemaitre
Copy link
Member

closing this PR then.

@glemaitre glemaitre closed this Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:gaussian_process Superseded PR has been replace by a newer PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants