Precision errors in KernelPCA #5970

hlin117 · 2015-12-07T02:20:18Z

The following triggers an error. The random state is set, but there are still some precision errors.

>>> from sklearn.datasets import make_circles
>>> from sklearn.decomposition import KernelPCA
>>> from sklearn.utils.testing import assert_array_almost_equal
>>>
>>> X_circle, y_circle = make_circles(400, random_state=0, factor=0.3, noise=0.15)
>>> kpca = KernelPCA(random_state=0).fit(X_circle)
>>> kpca2 = KernelPCA(n_components=53, random_state=0).fit(X_circle)
>>>
>>> assert_array_almost_equal(kpca.lambdas_[:50], kpca2.lambdas_[:50])
>>> assert_array_almost_equal(kpca.alphas_[:2, :10], kpca2.alphas_[:2, :10])
AssertionError:
Arrays are not almost equal to 6 decimals

(mismatch 80.0%)
 x: array([[  4.52347019e-02,  -7.41626885e-02,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          9.93712254e-01,   0.00000000e+00,   0.00000000e+00,...
 y: array([[ 0.0452347 , -0.07416269, -0.0008139 , -0.00213575,  0.00693525,
        -0.02749284,  0.05635691,  0.09004436,  0.00364872,  0.01541384],
       [ 0.00916891,  0.00235399, -0.04403946,  0.02472058,  0.07725798,
        -0.1409346 ,  0.38463443,  0.22568812,  0.01437247,  0.1645341 ]])

Old version of issue, without using random state.

The following is raising an error below, but I don't think it should be:

>>> from sklearn.datasets import make_circles
>>> X_circle, y_circle = make_circles(400, random_state=0, factor=0.3, noise=0.15)
>>> from sklearn.decomposition import KernelPCA
>>> kpca = KernelPCA().fit(X_circle)
>>> kpca2 = KernelPCA(n_components=53).fit(X_circle)
>>> 
>>> from sklearn.utils.testing import assert_array_almost_equal
>>> assert_array_almost_equal(kpca.lambdas_[:50], kpca2.lambdas_[:50])
>>> assert_array_almost_equal(kpca.alphas_[:2, :10], kpca2.alphas_[:2, :10])

AssertionError: 
Arrays are not almost equal to 6 decimals

(mismatch 90.0%)
 x: array([[-0.045235, -0.074163,  0.      ,  0.      ,  0.      , -0.157852,
         0.      ,  0.      ,  0.      ,  0.      ],
       [-0.009169,  0.002354, -0.07098 ,  0.003515, -0.015677, -0.354438,
        -0.18636 ,  0.056072,  0.017528, -0.178867]])
 y: array([[ 0.045235, -0.074163, -0.005973,  0.005636, -0.000913, -0.031506,
         0.006255,  0.047652, -0.015241, -0.018319],
       [ 0.009169,  0.002354, -0.175217, -0.019426,  0.011081, -0.130316,
        -0.056403, -0.004699, -0.175928, -0.041456]])

You could see some signs are flipped, some numbers rounded off, etc. Unless I'm confused about the theory of Kernel PCAs, this shouldn't be raising an error, right?

@mblondel, what do you think?

The text was updated successfully, but these errors were encountered:

lucasdavid · 2015-12-13T02:13:45Z

I don't think so, ARPACK's result depends on the initial guess. Have you tried to test this passing random_state=0 or eigen_solver='dense'? What happens then?

mblondel · 2015-12-13T13:41:33Z

I agree with @lucasdavid that this is most likely the culprit.

hlin117 · 2015-12-14T18:29:25Z

@lucasdavid: Well, it looks like passing in eigen_solver="dense" doesn't work =\

>>> from sklearn.datasets import make_circles
>>> X_circle, y_circle = make_circles(400, random_state=0, factor=0.3, noise=0.15)
>>> from sklearn.decomposition import KernelPCA
>>> kpca = KernelPCA(eigen_solver='dense').fit(X_circle)
>>> kpca2 = KernelPCA(n_components=53, eigen_solver='dense').fit(X_circle)
>>> from sklearn.utils.testing import assert_array_almost_equal
>>> assert_array_almost_equal(kpca.lambdas_[:50], kpca2.lambdas_[:50])
>>> assert_array_almost_equal(kpca.alphas_[:2, :10], kpca2.alphas_[:2, :10])

AssertionError:
Arrays are not almost equal to 6 decimals

(mismatch 90.0%)
 x: array([[-0.045235, -0.074163,  0.      ,  0.      ,  0.      , -0.157852,
         0.      ,  0.      ,  0.      ,  0.      ],
       [-0.009169,  0.002354, -0.07098 ,  0.003515, -0.015677, -0.354438,
        -0.18636 ,  0.056072,  0.017528, -0.178867]])
 y: array([[ 0.045235, -0.074163, -0.005973,  0.005636, -0.000913, -0.031506,
         0.006255,  0.047652, -0.015241, -0.018319],
       [ 0.009169,  0.002354, -0.175217, -0.019426,  0.011081, -0.130316,
        -0.056403, -0.004699, -0.175928, -0.041456]])

Also, it looks like KernelPCA doesn't accept the random_state=None keyword argument! @mblondel: You think this would be a good feature request?

mblondel · 2015-12-14T23:29:21Z

That would be nice, thanks.

lucasdavid · 2015-12-14T23:54:58Z

The docs seem to be outdated, if you look at the code you'll see the parameter already exists.
Also, I've noticed that you are asking for n_components=53, but circles only have two features. I think it would be a cleaner test to run one of them with n_components=1 and the other with n_components=2, and then comparing the first alphas.
With that being said, I ran your code with random_state=0 and it still fails (showing first column of alphas_):

(mismatch 100.0%)
 x: array([[  4.523470e-02],
       [  9.168905e-03],
       [  1.961695e-02],...
 y: array([[ -4.523470e-02],
       [ -9.168905e-03],
       [ -1.961695e-02],...

Please let me know if you find out what's happening. This is too interesting.

hlin117 · 2015-12-15T00:23:59Z

Thanks for the check, @lucasdavid.

Also, I've noticed that you are asking for n_components=53, but circles only have two features. I think it would be a cleaner test to run one of them with n_components=1 and the other with n_components=2, and then comparing the first alphas.

I wanted to use n_components=53 to capture a lot of the output in the data. the 10 numbers i chose to pick were totally arbitrary.

I think the negative values in that test are permissible. According to the PCA docs:

Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this implementation, running fit twice on the same matrix can lead to principal components with signs flipped (change in direction). For this reason, it is important to always use the same estimator object to transform data in a consistent fashion.

However, I'm still getting precision errors from this:

>>> from sklearn.datasets import make_circles
>>> X_circle, y_circle = make_circles(400, random_state=0, factor=0.3, noise=0.15)
>>> from sklearn.decomposition import KernelPCA
>>> kpca = KernelPCA(random_state=0).fit(X_circle)
>>> kpca2 = KernelPCA(n_components=53, random_state=0).fit(X_circle)
>>> from sklearn.utils.testing import assert_array_almost_equal
>>> assert_array_almost_equal(kpca.lambdas_[:50], kpca2.lambdas_[:50])
>>> assert_array_almost_equal(kpca.alphas_[:2, :10], kpca2.alphas_[:2, :10])
AssertionError:
Arrays are not almost equal to 6 decimals

(mismatch 80.0%)
 x: array([[  4.52347019e-02,  -7.41626885e-02,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          9.93712254e-01,   0.00000000e+00,   0.00000000e+00,...
 y: array([[ 0.0452347 , -0.07416269, -0.0008139 , -0.00213575,  0.00693525,
        -0.02749284,  0.05635691,  0.09004436,  0.00364872,  0.01541384],
       [ 0.00916891,  0.00235399, -0.04403946,  0.02472058,  0.07725798,
        -0.1409346 ,  0.38463443,  0.22568812,  0.01437247,  0.1645341 ]])

I'm going to update the OP with this error.

hlin117 · 2015-12-15T08:14:37Z

Does this issue seem worthy of the "bug" label?

lucasdavid · 2015-12-19T04:19:09Z

I'm not seeing where the second row of your x array starts, but considering this is happening for me, I'm going to assume that you're getting the same alphas_ for the first two columns. That is, assert_array_almost_equal(kpca.alphas_[:, :2], kpca2.alphas_[:, :2]) passes, whereas assert_array_almost_equal(kpca.alphas_[:, :3], kpca2.alphas_[:, :3]) fails:

(mismatch 33.33333333333333%)
 x: array([[ 0.045235, -0.074163, -0.01721 ],
       [ 0.009169,  0.002354, -0.006979],
       [ 0.019617, -0.030738,  0.062522],...
 y: array([[ 0.045235, -0.074163, -0.006221],
       [ 0.009169,  0.002354, -0.005059],
       [ 0.019617, -0.030738, -0.227136],...

Does this issue seem ...

I'm haven't studied KernelPCA appropriately, ~~but I know the alphas are the coefficients used to linearly combine the projections of the samples to form the eigenvectors~~ (disconsider that. Still, I maintain my guess bellow). Considering that only two eigenvectors are sufficient to generate the original space of circles, we only need "correct" values in the first two columns of alphas_ to generates the correct two first eigenvectors (eq. 5).

Additionally, replacing circles with the s_curve (3D) makes the first 3 columns of alphas_ to be equal (but not the following ones). Replacing circles with the iris (4D) makes the first 4 columns of alphas_ to be equal (but not the following ones).

My only guess is that the algorithm was implemented to optimize a maximum of N eigenvalues/vectors (and therefore alphas_), since one would hardly (I think?) use dimensionality reduction to move from the 2D to a higher dimensional space. Hence I BELIEVE that there's no bug.

Let me know if you don't agree with this. Also, the algorithm didn't flipped the signals once during my tests. Was it updated these last days?

thomasjpfan · 2022-05-27T16:55:18Z

I think this was fixed in #13241. Current lambdas_ and alphas_ are deprecated, but the corresponding code snippet now passes:

from sklearn.datasets import make_circles
X_circle, y_circle = make_circles(400, random_state=0, factor=0.3, noise=0.15)

from sklearn.decomposition import KernelPCA
from numpy.testing import assert_allclose
kpca = KernelPCA(random_state=0).fit(X_circle)
kpca2 = KernelPCA(n_components=53, random_state=0).fit(X_circle)

assert_allclose(kpca.eigenvalues_, kpca2.eigenvalues_[:2])
assert_allclose(kpca.eigenvectors_, kpca2.eigenvectors_[:, :2])

The [:2] slicing at the end is to choose the first 2 components in kpca2 so it can be compared to kpca, which only has 2 components. With that in mind, I am closing this issue.

hlin117 mentioned this issue Dec 14, 2015

Creation of the attribute LDA.explained_variance_ratio_, for the eige… #5216

Merged

lesteve mentioned this issue Apr 27, 2017

Differences among the results of KernelPCA with rbf kernel #8798

Closed

mfeurer mentioned this issue Jul 23, 2019

Updates sklearn to version 0.21; closes #629 automl/auto-sklearn#703

Merged

cmarmo added the module:decomposition label Jan 29, 2021

thomasjpfan closed this as completed May 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Precision errors in KernelPCA #5970

Precision errors in KernelPCA #5970

hlin117 commented Dec 7, 2015

lucasdavid commented Dec 13, 2015

mblondel commented Dec 13, 2015

hlin117 commented Dec 14, 2015

mblondel commented Dec 14, 2015 via email

lucasdavid commented Dec 14, 2015

hlin117 commented Dec 15, 2015

hlin117 commented Dec 15, 2015

lucasdavid commented Dec 19, 2015

thomasjpfan commented May 27, 2022

Precision errors in KernelPCA #5970

Precision errors in KernelPCA #5970

Comments

hlin117 commented Dec 7, 2015

Old version of issue, without using random state.

lucasdavid commented Dec 13, 2015

mblondel commented Dec 13, 2015

hlin117 commented Dec 14, 2015

mblondel commented Dec 14, 2015 via email

lucasdavid commented Dec 14, 2015

hlin117 commented Dec 15, 2015

hlin117 commented Dec 15, 2015

lucasdavid commented Dec 19, 2015

thomasjpfan commented May 27, 2022