Skip to content

[MRG+1] Enforce deterministic output in kernel PCA #13241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Feb 26, 2019

Conversation

bellet
Copy link
Contributor

@bellet bellet commented Feb 25, 2019

Fixes #8798

This PR adds a call to svd_flip to ensure that the output of KPCA is deterministic, as done for PCA.

Example code:

import numpy as np
from sklearn.decomposition import KernelPCA

np.random.seed(42)
data = np.random.rand(12).reshape(4, 3)

for i in range(10):
    kpca = KernelPCA(n_components=2, eigen_solver='arpack')
    print(kpca.fit_transform(data)[0])

Output before this PR:

[-0.42930193 -0.11840723]
[0.42930193 0.11840723]
[-0.42930193 -0.11840723]
[0.42930193 0.11840723]
[-0.42930193 -0.11840723]
[0.42930193 0.11840723]
[-0.42930193 -0.11840723]
[ 0.42930193 -0.11840723]
[0.42930193 0.11840723]
[-0.42930193 -0.11840723]

With this PR:

[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]
[-0.42930193 -0.11840723]

Some comments:

  • Calling svd_flip directly requires to copy the fitted self.alphas_, which is not great. The alternative would be to add an equivalent of svd_flip for eigendecomposition (taking a single input instead of two). I can easily do this if this is the preferred solution.
  • This deterministic behavior does not seem to be tested for PCA, so I did not add any test here either. Can do if needed.

@agramfort
Copy link
Member

I confirm that it fixes the pb but can you still add a test and udpate what's new?

yes it should be done for PCA too.

@bellet
Copy link
Contributor Author

bellet commented Feb 25, 2019

I confirm that it fixes the pb but can you still add a test and udpate what's new?

yes it should be done for PCA too.

Done.

@@ -71,6 +71,23 @@ def test_kernel_pca_consistent_transform():
assert_array_almost_equal(transformed1, transformed2)


def test_kernel_pca_deterministic_output():
state = np.random.RandomState(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

state -> rng

it's how we call RandomState usually.

Copy link
Member

@agramfort agramfort left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides my nitpick

good to go from my end

@bellet bellet changed the title [MRG] Enforce deterministic output in kernel PCA [MRG+1] Enforce deterministic output in kernel PCA Feb 25, 2019
@bellet
Copy link
Contributor Author

bellet commented Feb 25, 2019

Done!

# flip eigenvectors' sign to enforce deterministic output
# note: copying the second element is needed so that both inputs do
# not refer to the same object
self.alphas_, _ = svd_flip(self.alphas_, self.alphas_.copy().T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it work to do

svd_flip(self.alphas_, np.empty_like(self.alphas_).T)

to avoid a memory copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct since the value of the second argument does not affect our output. Done

transformed_X[i, :] = kpca.fit_transform(X)[0]
i += 1

assert np.isclose(transformed_X, transformed_X[0, :]).all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use numpy.testsing.assert_allclose

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find that undestand what is done in this test takes some effort here because we check equality for both different random seeds and solvers (and we store the results in a matrix that has non nstandard (n_random_seeds*n_solvers, n_components) shape. Maybe something like,

    for solver in eigen_solver: 
        transformed_X = np.zeros((10, 2))
        for i in range(10):
            kpca = KernelPCA(n_components=2, eigen_solver=solver,
                             random_state=i)
            transformed_X[i, :] = kpca.fit_transform(X)[0]

        assert_allclose(transformed_X, transformed_X[0, :])

could be easier?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also is it needed to set the random state to i rather than just use random_state=rng?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

random_state=rng would use the same seed for all runs. The point here is to check that the arpack random state does not result in different solutions because of sign ambiguity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used random_state=i to show explicitly that the random state changes from one run to the next. But removing it would be valid as well as the random state would still change

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may be wrong but i don't agree that random_state=rng would use the same seed, unless rng is an int. Here, rng is a RandomState object (initialized a few lines above) so it has internal state. For instance:

~{main}$ ipython
Python 3.7.2 (default, Feb 11 2019, 11:12:39)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.2.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np

In [2]: rng = np.random.RandomState(0)

In [3]: rng.randn(3)
Out[3]: array([1.76405235, 0.40015721, 0.97873798])

In [4]: rng.randn(3)
Out[4]: array([ 2.2408932 ,  1.86755799, -0.97727788])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I can switch to that

transformed_X[i, :] = pca.fit_transform(X)[0]
i += 1

assert np.isclose(transformed_X, transformed_X[0, :]).all()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as above

# note: copying the second element is needed so that both inputs do
# not refer to the same object
self.alphas_, _ = svd_flip(self.alphas_,
np.empty_like(self.alphas_).T)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, thanks! maybe remove the note: above since it's no longer relevant.

@bellet
Copy link
Contributor Author

bellet commented Feb 25, 2019

I have simplified the tests as suggested by @vene @rth

@vene
Copy link
Member

vene commented Feb 25, 2019

LGTM pending CI pass! Thanks @bellet

@agramfort agramfort merged commit 8647658 into scikit-learn:master Feb 26, 2019
@agramfort
Copy link
Member

thx @bellet

@rth
Copy link
Member

rth commented Feb 26, 2019

Thanks !

Does this mean we can remove the _UnstableOn32BitMixin from the KernelPCA class and tests will pass on 32 bit arch as well? (The corresponding estimator tag is non_deterministic)

@ogrisel
Copy link
Member

ogrisel commented Mar 5, 2019

The CRON jobs for nightly builds on linux 32 bit pass:

https://travis-ci.org/MacPython/scikit-learn-wheels/builds

xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
* enforce deterministic output in kernel PCA

* add tests and update whats new

* replace state by rng

* simplified assert

* avoid copy

* clarify tests

* remove now useless comment

* use rng as seed everywhere
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
* enforce deterministic output in kernel PCA

* add tests and update whats new

* replace state by rng

* simplified assert

* avoid copy

* clarify tests

* remove now useless comment

* use rng as seed everywhere
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants