FIX GaussianProcessRegression(normalize_y=True).predict(X, return_cov=True) #19706

AidarShakerimoff · 2021-03-17T23:36:46Z

Problem with GPR predictions on multitarget data. Fixes issues #17394 and #18065

What does this fix?

Initially, when a multitarget data was given, the _gpr.py was not able to undo the normalisation of y_cov and y_var in the predict() function (lines 355 and 381). The problem was the fact that the code was trying to multiply two horizontal vectors in both cases and returned error ValueError: operands could not be broadcast together with shapes (a,) (b,) (e.g. number of samples 'a' = 4, number of target components 'b' = 3). I solved this by reshaping the y_var vector and rows ofy_covinto vertical form before multiplication with the y_train_std vector. As a result, we obtain y_std of form(n_samples, n_output_dims)and y_cov of form (n_samples, n_samples, n_output_dims). When targets are single values the undo normalisation is performed as usual.
The solution is based on the solution from Rasmussen and Williams' book for cases when classes of outputs are independent from each other.

Comment

This solution will work only if the target classes are independent.

Problem with GPR predictions on multitarget data.

afonari · 2021-03-18T11:53:33Z

I'm not with merging rights here, but I think you need tests for this one, see bottom of #19703 for a test example.

AidarShakerimoff · 2021-03-18T13:30:02Z

I'm not with merging rights here, but I think you need tests for this one, see bottom of #19703 for a test example.

Thank you, I will do it

cmarmo · 2021-03-18T14:48:32Z

@AidarShakerimoff thanks for your pull request. Linting errors are present. Do you mind running

$ flake8 sklearn/gaussian_process/_gpr.py

on your repository? This will allow you to display all the errors. See also this page for more details about flake8.

AidarShakerimoff · 2021-03-18T15:20:20Z

@AidarShakerimoff thanks for your pull request. Linting errors are present. Do you mind running
$ flake8 sklearn/gaussian_process/_gpr.py
on your repository? This will allow you to display all the errors. See also this page for more details about flake8.

Thank you for your advice!

ogrisel

Please add a new test (or update an existing test) to check that if the target data is already normalized (before fit, for instance using sklearn.preprocessing.scale), passing normalize_y=True or normalize_y=False makes predict return similar results (on toy random datasets generated with make_regression for instance).

Please check with return_cov=True and with multi-target y.

sklearn/gaussian_process/_gpr.py

AidarShakerimoff · 2021-03-29T13:36:16Z

Please add a new test (or update an existing test) to check that if the target data is already normalized (before fit, for instance using sklearn.preprocessing.scale), passing normalize_y=True or normalize_y=False makes predict return similar results (on toy random datasets generated with make_regression for instance).

Please check with return_cov=True and with multi-target y.

Thank you very much, I will finish it soon.

AidarShakerimoff · 2021-04-09T13:40:11Z

By the way, should I also try to handle cases when there is a correlation between classes of the output? Or should keep assuming that the classes are independent?

multitarget now produces +1 dimenisonal std and cov

AidarShakerimoff · 2021-04-10T12:20:31Z

How can I solve changelog problem?

AidarShakerimoff · 2021-04-10T13:23:05Z

Please add a new test (or update an existing test) to check that if the target data is already normalized (before fit, for instance using sklearn.preprocessing.scale), passing normalize_y=True or normalize_y=False makes predict return similar results (on toy random datasets generated with make_regression for instance).

Please check with return_cov=True and with multi-target y.

I am trying to implement the following code that compares results of two GPR (normalized and not normalized) on the dataset that is already normalized.

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn import preprocessing
from sklearn import datasets
import numpy as np
from sklearn.utils._testing \
   import assert_almost_equal
# toy dataset 
X, y = datasets.make_regression(n_targets=1)
X_train, y_train = X[:50], y[:50]
X_test = X[50:]

# normalize the data before fitting the GP
y_train_norm = preprocessing.normalize(y_train.reshape(y_train.shape[0],-1), axis=0).reshape(y_train.shape)


gp1 = GaussianProcessRegressor(
   normalize_y=True,
)
gp2 = GaussianProcessRegressor(
   normalize_y=False,
)

gp1.fit(X_train, y_train_norm)
gp2.fit(X_train, y_train_norm)

#should provide same results
v1, d1 = gp1.predict(X_test, return_std=True)
v2, d2 = gp2.predict(X_test, return_std=True)

assert_almost_equal(v1,v2)
assert_almost_equal(d1,d2)

_, c1 = gp1.predict(X_test, return_cov=True)

_, c2 = gp2.predict(X_test, return_cov=True)
assert_almost_equal(c1,c2)

The assertion provides error even when I work with single-target data.

---------------------------------------------------------------------------

AssertionError                            Traceback (most recent call last)

<ipython-input-9-66e30f5008d1> in <module>()
     28 v2, d2 = gp2.predict(X_test, return_std=True)
     29 
---> 30 assert_almost_equal(v1,v2)
     31 assert_almost_equal(d1,d2)
     32 

2 frames

/usr/local/lib/python3.7/dist-packages/numpy/testing/_private/utils.py in assert_array_compare(comparison, x, y, err_msg, verbose, header, precision, equal_nan, equal_inf)
    838                                 verbose=verbose, header=header,
    839                                 names=('x', 'y'), precision=precision)
--> 840             raise AssertionError(msg)
    841     except ValueError:
    842         import traceback

AssertionError: 
Arrays are not almost equal to 7 decimals

Mismatched elements: 50 / 50 (100%)
Max absolute difference: 0.00314954
Max relative difference: 1.22647451e+42
 x: array([0.0031495, 0.0031495, 0.0031495, 0.0031495, 0.0031495, 0.0031495,
       0.0031495, 0.0031495, 0.0031495, 0.0031495, 0.0031495, 0.0031495,
       0.0031495, 0.0031495, 0.0031495, 0.0031495, 0.0031495, 0.0031495,...
 y: array([-1.3586266e-39, -1.3203828e-31,  1.0705813e-35, -1.2190218e-34,
        1.8102779e-31, -1.8333309e-42,  2.9537442e-35,  1.2073815e-32,
        8.7449675e-35, -6.3151308e-35, -6.7287241e-33, -1.2166517e-35,...

Does it imply that there is something wrong in the ways how normalization is done in GPR (regardless of target's dimensionality)?

cmarmo · 2021-04-11T11:23:13Z

How can I solve changelog problem?

The check verifies that this pull request is referenced in the changelog if necessary: this is a bugfix so it is worth an entry in the changelog. But I suggest to wait for a reviewer clarifying if this fix is useful in the bugfix release 0.24.2 (as #19703) or should go in 1.0. Thanks for your patience.

AidarShakerimoff · 2021-04-13T14:00:37Z

How can I solve changelog problem?

The check verifies that this pull request is referenced in the changelog if necessary: this is a bugfix so it is worth an entry in the changelog. But I suggest to wait for a reviewer clarifying if this fix is useful in the bugfix release 0.24.2 (as #19703) or should go in 1.0. Thanks for your patience.

Thank you for clarification!

glemaitre · 2021-04-21T16:55:51Z

I don't want to merge #19703 with code that might not be working properly. Instead, I will add it here:

    # Test multi-target data
    # This test should be uncommented when a solution to is found #19706
    # n_samples = X.shape[0]
    # rng = np.random.RandomState(0)
    # Y_with_constant = np.concatenate([
    #     y[:, np.newaxis],  # non-constant target
    #     np.full(shape=(X.shape[0], 1), fill_value=2)  # constant target
    # ], axis=1)
    #
    # # Ensure no tracebacks
    # gpr.fit(X, Y_with_constant)
    # Y_pred, Y_cov = gpr.predict(X, return_cov=True)
    #
    # assert_all_finite(Y_pred)
    # assert_all_finite(Y_cov)
    #
    # # Assert correct shapes
    # assert y_pred.shape == (X, Y)
    # assert y_cov.shape == (X, Y)

I will just add a FIXME and reference to this PR.

glemaitre · 2021-12-16T10:11:40Z

Superseded by #21996
I opened this subsequent PR to reuse the outer product that was already working for normalize_y=True.
I also added @AidarShakerimoff as a contributor of the subsequent PR.

glemaitre · 2021-12-16T10:12:56Z

closing this PR then.

[FIX] of the issue scikit-learn#17394 and scikit-learn#18065

43126ec

Problem with GPR predictions on multitarget data.

github-actions bot added the module:gaussian_process label Mar 17, 2021

Update _gpr.py

c0247c0

ogrisel reviewed Mar 19, 2021

View reviewed changes

sklearn/gaussian_process/_gpr.py Outdated Show resolved Hide resolved

ogrisel changed the title ~~[FIX] of the issue scikit-learn#17394 and scikit-learn#18065~~ [FIX] GaussianProcessRegression(normalize_y=).predict(X, return_cov=True) Mar 19, 2021

ogrisel changed the title ~~[FIX] GaussianProcessRegression(normalize_y=).predict(X, return_cov=True)~~ [FIX] GaussianProcessRegression(normalize_y=True).predict(X, return_cov=True) Mar 19, 2021

AidarShakerimoff added 4 commits March 29, 2021 16:27

linting

d119b07

linting2

865240a

linting3

db500ff

doc

86df0ee

Update _gpr.py

f4b702b

AidarShakerimoff added 3 commits April 10, 2021 13:35

Update _gpr.py

b45372c

multioutput test is improved

5f85260

multitarget now produces +1 dimenisonal std and cov

Update test_gpr.py

4d2087d

cmarmo added the Bug label Apr 11, 2021

glemaitre requested review from ogrisel and glemaitre and removed request for ogrisel May 10, 2021 10:19

glemaitre changed the title ~~[FIX] GaussianProcessRegression(normalize_y=True).predict(X, return_cov=True)~~ FIX GaussianProcessRegression(normalize_y=True).predict(X, return_cov=True) Dec 16, 2021

glemaitre mentioned this pull request Dec 16, 2021

FIX make GPR works with multi-target and normalize_y=False #21996

Closed

glemaitre removed their request for review December 16, 2021 10:10

glemaitre added Superseded PR has been replace by a newer PR and removed Bug labels Dec 16, 2021

glemaitre closed this Dec 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX GaussianProcessRegression(normalize_y=True).predict(X, return_cov=True) #19706

FIX GaussianProcessRegression(normalize_y=True).predict(X, return_cov=True) #19706

AidarShakerimoff commented Mar 17, 2021 •

edited

Loading

afonari commented Mar 18, 2021

AidarShakerimoff commented Mar 18, 2021

cmarmo commented Mar 18, 2021

AidarShakerimoff commented Mar 18, 2021

ogrisel left a comment

AidarShakerimoff commented Mar 29, 2021

AidarShakerimoff commented Apr 9, 2021

AidarShakerimoff commented Apr 10, 2021

AidarShakerimoff commented Apr 10, 2021 •

edited

Loading

cmarmo commented Apr 11, 2021

AidarShakerimoff commented Apr 13, 2021

glemaitre commented Apr 21, 2021

glemaitre commented Dec 16, 2021

glemaitre commented Dec 16, 2021

FIX GaussianProcessRegression(normalize_y=True).predict(X, return_cov=True) #19706

FIX GaussianProcessRegression(normalize_y=True).predict(X, return_cov=True) #19706

Conversation

AidarShakerimoff commented Mar 17, 2021 • edited Loading

What does this fix?

Comment

afonari commented Mar 18, 2021

AidarShakerimoff commented Mar 18, 2021

cmarmo commented Mar 18, 2021

AidarShakerimoff commented Mar 18, 2021

ogrisel left a comment

Choose a reason for hiding this comment

AidarShakerimoff commented Mar 29, 2021

AidarShakerimoff commented Apr 9, 2021

AidarShakerimoff commented Apr 10, 2021

AidarShakerimoff commented Apr 10, 2021 • edited Loading

cmarmo commented Apr 11, 2021

AidarShakerimoff commented Apr 13, 2021

glemaitre commented Apr 21, 2021

glemaitre commented Dec 16, 2021

glemaitre commented Dec 16, 2021

AidarShakerimoff commented Mar 17, 2021 •

edited

Loading

AidarShakerimoff commented Apr 10, 2021 •

edited

Loading