Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
812f41d
FIX Fixes test_scale_and_stability
thomasjpfan Nov 3, 2020
6637c3b
ENH Uses Nics suggestion
thomasjpfan Nov 3, 2020
665a6c8
STY Linting
thomasjpfan Nov 4, 2020
0f2aef8
REV Revert change
thomasjpfan Nov 4, 2020
679151c
REV Adjust weights before
thomasjpfan Nov 4, 2020
5e5cdaf
BUG Fix
thomasjpfan Nov 4, 2020
538798a
ENH Sets condition on pinv2
thomasjpfan Nov 4, 2020
235565e
DOC Adds whats new
thomasjpfan Nov 5, 2020
f96e865
DOC Only rank deficient X
thomasjpfan Nov 5, 2020
a6d4d99
DOC Rank-deficient in both
thomasjpfan Nov 5, 2020
43e5cba
REV Less diffs
thomasjpfan Nov 5, 2020
0fc771b
TST Splits test_scale_and_stability into 3 tests
thomasjpfan Nov 5, 2020
59d6d85
TST Only rank decifient y
thomasjpfan Nov 5, 2020
76876f5
DOC Update
thomasjpfan Nov 5, 2020
6bdd21b
DOC Update docstring for tests
thomasjpfan Nov 6, 2020
0aebfd4
TST Adds test that fails on master
thomasjpfan Nov 6, 2020
a4386c0
TST Back to one test
thomasjpfan Nov 6, 2020
b680691
DOC Update docstring
thomasjpfan Nov 6, 2020
9158c96
TST Updated with randomly generated X and y
thomasjpfan Nov 6, 2020
a22f454
TST Places back old tests
thomasjpfan Nov 6, 2020
7881351
Merge remote-tracking branch 'upstream/master' into test_scale_and_st…
thomasjpfan Nov 6, 2020
70cd30f
STY Linting
thomasjpfan Nov 6, 2020
ece83c5
TST Adds back original dataset
thomasjpfan Nov 6, 2020
9bfec22
REV Less diffs
thomasjpfan Nov 6, 2020
e9485b8
TST Do not think old dataset is needed
thomasjpfan Nov 6, 2020
2a4733e
ENH Removes unneeded code
thomasjpfan Nov 6, 2020
f2318b0
TST Increase tolerance
thomasjpfan Nov 6, 2020
ed30019
Merge remote-tracking branch 'upstream/master' into test_scale_and_st…
thomasjpfan Nov 6, 2020
5dbaaf9
REV Less diffs
thomasjpfan Nov 6, 2020
dd56c1b
TST Use atol
thomasjpfan Nov 6, 2020
bf588bb
REV Less diffs
thomasjpfan Nov 6, 2020
c39e682
TST Update tests
thomasjpfan Nov 7, 2020
c8aa37c
REV Places back all estimators
thomasjpfan Nov 7, 2020
8287d8c
ENH Adds original dataset
thomasjpfan Nov 7, 2020
3dbdd9e
Merge remote-tracking branch 'upstream/master' into test_scale_and_st…
thomasjpfan Nov 10, 2020
8a8860f
ENH Update
thomasjpfan Nov 10, 2020
08cb652
MNT Change eps
thomasjpfan Nov 10, 2020
8abba5a
REV Less diffs
thomasjpfan Nov 10, 2020
a02c2d1
Try to remove _UnstableArchMixin from CCA
ogrisel Nov 13, 2020
e76209f
Merge branch 'master' of github.com:scikit-learn/scikit-learn into te…
ogrisel Nov 13, 2020
e79ba18
[cd build]
ogrisel Nov 13, 2020
9d6296a
Remove noqa flag
ogrisel Nov 13, 2020
0d6d52d
[arm64]
ogrisel Nov 13, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions doc/whats_new/v0.24.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,9 @@ Changelog
predictions for `est.transform(Y)` when the training data is single-target.
:pr:`17095` by `Nicolas Hug`_.

- |Fix| Increases the stability of :class:`cross_decomposition.CCA` :pr:`18746`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify that it's only for poorly conditioned matrices? So that not all users expect to have changes in the results

Copy link
Member Author

@thomasjpfan thomasjpfan Nov 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue comes from:

https://github.com/scipy/scipy/blob/8e30f7797bd1ee442f4f1a25172e4402521c1e16/scipy/linalg/basic.py#L1385

when pinv2 is called. On windows and the pypi version of numpy the singular values for Xk are:

[0.7, 0.14, 9e-16]

which results in a rank of 3.

On other platforms, where the original test passes, the singular values are

[0.7, 0.14, 1.4e-17]

which results in a rank of 2. This is by designed as stated reference paper on page 12, where the rank should decrease when Xk gets updated.

Xk is updated here:

Xk -= np.outer(x_scores, x_loadings)

Given, this I do not think its for poorly conditioned matrices. It is our usage of pinv2 that was flaky.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when pinv2 is called. On windows and the pypi version of numpy the singular values for Xk are:

After which iteration? If this happens at the first iter, then this is indeed a problem of condition number I believe

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(by first iteration I mean the first component, or equivalently the first call to _get_first_singular_vectors_power_method)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After which iteration? If this happens at the first iter, then this is indeed a problem of condition number I believe

This happened on the second call to _get_first_singular_vectors_power_method.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would still be valuable to identify in which cases there's a stability improvement. Clearly, not all users will be impacted by this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this happens randomly. I updated test_scale_and_stability to generate random ys and found that there were seeds that fail on my machine and on windows.

From a user point of view, for some datasets they may have gotten an incorrect model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both random y and X? Because otherwise the problem might just be coming from X.

I still believe this is related to poorly conditioned matrices because what you pass to _get_first_singular_vectors_power_method at iteration i are the deflated matrices from iteration i - 1 (deflation = subtracting with a rank-1 matrix to obtain a (r - 1) rank matrix)

I'm quite certain that the name of the parameter to pinv (cond or rcond) is related to the condition number

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both random y and X? Because otherwise the problem might just be coming from X.

Updated with randomly generated X and y. The condition number will alway gets much bigger when the matrix becomes a r - 1 rank matrix.

When the rank becomes r - 1, pinv2 by default is having trouble determining the rank of the matrix:

https://github.com/scipy/scipy/blob/8e30f7797bd1ee442f4f1a25172e4402521c1e16/scipy/linalg/basic.py#L1457-L1466

Setting cond= 10 * eps helps in this case.

Copy link
Member Author

@thomasjpfan thomasjpfan Nov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Specificly when pinv2 thinks a r-1 matrix is rank r, it divides by a really small singular value when computing the inverse.

Copy link
Member

@ogrisel ogrisel Nov 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is not just for CCA but also for related PLSCanonical model, isn't it? PLSSVD uses a different fit method.

PLSCanonical uses mode="A" which does not rely on those pinv2 calls. So the what's new entry is correct.

by `Thomas Fan`_.

- |API| For :class:`cross_decomposition.NMF`,
the `init` value, when 'init=None' and
n_components <= min(n_samples, n_features) will be changed from
Expand Down
7 changes: 3 additions & 4 deletions sklearn/cross_decomposition/_pls.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,6 @@
from scipy.linalg import pinv2, svd

from ..base import BaseEstimator, RegressorMixin, TransformerMixin
from ..base import _UnstableArchMixin
from ..base import MultiOutputMixin
from ..utils import check_array, check_consistent_length
from ..utils.extmath import svd_flip
Expand Down Expand Up @@ -45,8 +44,8 @@ def _get_first_singular_vectors_power_method(X, Y, mode="A", max_iter=500,
# As a result, and as detailed in the Wegelin's review, CCA (i.e. mode
# B) will be unstable if n_features > n_samples or n_targets >
# n_samples
X_pinv = pinv2(X, check_finite=False)
Y_pinv = pinv2(Y, check_finite=False)
X_pinv = pinv2(X, check_finite=False, cond=10*eps)
Y_pinv = pinv2(Y, check_finite=False, cond=10*eps)

for i in range(max_iter):
if mode == "B":
Expand Down Expand Up @@ -683,7 +682,7 @@ def __init__(self, n_components=2, *, scale=True, algorithm="nipals",
max_iter=max_iter, tol=tol, copy=copy)


class CCA(_UnstableArchMixin, _PLS):
class CCA(_PLS):
"""Canonical Correlation Analysis, also known as "Mode B" PLS.

Read more in the :ref:`User Guide <cross_decomposition>`.
Expand Down
89 changes: 46 additions & 43 deletions sklearn/cross_decomposition/tests/test_pls.py
Original file line number Diff line number Diff line change
Expand Up @@ -382,54 +382,57 @@ def test_copy(Est):
pls.predict(X.copy(), copy=False))


@pytest.mark.xfail
@pytest.mark.parametrize('Est', (CCA, PLSCanonical, PLSRegression, PLSSVD))
def test_scale_and_stability(Est):
# scale=True is equivalent to scale=False on centered/scaled data
# This allows to check numerical stability over platforms as well

def _generate_test_scale_and_stability_datasets():
"""Generate dataset for test_scale_and_stability"""
# dataset for non-regression 7818
rng = np.random.RandomState(0)

d = load_linnerud()
X1 = d.data
Y1 = d.target
# causes X[:, -1].std() to be zero
X1[:, -1] = 1.0

# From bug #2821
# Test with X2, Y2 s.t. clf.x_score[:, 1] == 0, clf.y_score[:, 1] == 0
# This test robustness of algorithm when dealing with value close to 0
X2 = np.array([[0., 0., 1.],
[1., 0., 0.],
[2., 2., 2.],
[3., 5., 4.]])
Y2 = np.array([[0.1, -0.2],
[0.9, 1.1],
[6.2, 5.9],
[11.9, 12.3]])

# Non-regression for https://github.com/scikit-learn/scikit-learn/pull/7819
n_samples = 1000
n_targets = 5
n_features = 10
Q = rng.randn(n_targets, n_features)
Y3 = rng.randn(n_samples, n_targets)
X3 = np.dot(Y3, Q) + 2 * rng.randn(n_samples, n_features) + 1
X3 *= 1000

for (X, Y) in [(X1, Y1), (X2, Y2), (X3, Y3)]:
X_std = X.std(axis=0, ddof=1)
X_std[X_std == 0] = 1
Y_std = Y.std(axis=0, ddof=1)
Y_std[Y_std == 0] = 1
X_s = (X - X.mean(axis=0)) / X_std
Y_s = (Y - Y.mean(axis=0)) / Y_std

X_score, Y_score = Est(scale=True).fit_transform(X, Y)
X_s_score, Y_s_score = Est(scale=False).fit_transform(X_s, Y_s)

assert_array_almost_equal(X_s_score, X_score)
assert_array_almost_equal(Y_s_score, Y_score)
Y = rng.randn(n_samples, n_targets)
X = np.dot(Y, Q) + 2 * rng.randn(n_samples, n_features) + 1
X *= 1000
yield X, Y

# Data set where one of the features is constaint
X, Y = load_linnerud(return_X_y=True)
# causes X[:, -1].std() to be zero
X[:, -1] = 1.0
yield X, Y

X = np.array([[0., 0., 1.],
[1., 0., 0.],
[2., 2., 2.],
[3., 5., 4.]])
Y = np.array([[0.1, -0.2],
[0.9, 1.1],
[6.2, 5.9],
[11.9, 12.3]])
yield X, Y

# Seeds that provide a non-regression test for #18746, where CCA fails
seeds = [530, 741]
for seed in seeds:
rng = np.random.RandomState(seed)
X = rng.randn(4, 3)
Y = rng.randn(4, 2)
yield X, Y


@pytest.mark.parametrize('Est', (CCA, PLSCanonical, PLSRegression, PLSSVD))
@pytest.mark.parametrize('X, Y', _generate_test_scale_and_stability_datasets())
def test_scale_and_stability(Est, X, Y):
"""scale=True is equivalent to scale=False on centered/scaled data
This allows to check numerical stability over platforms as well"""

X_s, Y_s, *_ = _center_scale_xy(X, Y)

X_score, Y_score = Est(scale=True).fit_transform(X, Y)
X_s_score, Y_s_score = Est(scale=False).fit_transform(X_s, Y_s)

assert_allclose(X_s_score, X_score, atol=1e-4)
assert_allclose(Y_s_score, Y_score, atol=1e-4)


@pytest.mark.parametrize('Est', (PLSSVD, PLSCanonical, CCA))
Expand Down