Skip to content

Precision errors in KernelPCA #5970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hlin117 opened this issue Dec 7, 2015 · 9 comments
Closed

Precision errors in KernelPCA #5970

hlin117 opened this issue Dec 7, 2015 · 9 comments

Comments

@hlin117
Copy link
Contributor

hlin117 commented Dec 7, 2015

The following triggers an error. The random state is set, but there are still some precision errors.

>>> from sklearn.datasets import make_circles
>>> from sklearn.decomposition import KernelPCA
>>> from sklearn.utils.testing import assert_array_almost_equal
>>>
>>> X_circle, y_circle = make_circles(400, random_state=0, factor=0.3, noise=0.15)
>>> kpca = KernelPCA(random_state=0).fit(X_circle)
>>> kpca2 = KernelPCA(n_components=53, random_state=0).fit(X_circle)
>>>
>>> assert_array_almost_equal(kpca.lambdas_[:50], kpca2.lambdas_[:50])
>>> assert_array_almost_equal(kpca.alphas_[:2, :10], kpca2.alphas_[:2, :10])
AssertionError:
Arrays are not almost equal to 6 decimals

(mismatch 80.0%)
 x: array([[  4.52347019e-02,  -7.41626885e-02,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          9.93712254e-01,   0.00000000e+00,   0.00000000e+00,...
 y: array([[ 0.0452347 , -0.07416269, -0.0008139 , -0.00213575,  0.00693525,
        -0.02749284,  0.05635691,  0.09004436,  0.00364872,  0.01541384],
       [ 0.00916891,  0.00235399, -0.04403946,  0.02472058,  0.07725798,
        -0.1409346 ,  0.38463443,  0.22568812,  0.01437247,  0.1645341 ]])

Old version of issue, without using random state.

The following is raising an error below, but I don't think it should be:

>>> from sklearn.datasets import make_circles
>>> X_circle, y_circle = make_circles(400, random_state=0, factor=0.3, noise=0.15)
>>> from sklearn.decomposition import KernelPCA
>>> kpca = KernelPCA().fit(X_circle)
>>> kpca2 = KernelPCA(n_components=53).fit(X_circle)
>>> 
>>> from sklearn.utils.testing import assert_array_almost_equal
>>> assert_array_almost_equal(kpca.lambdas_[:50], kpca2.lambdas_[:50])
>>> assert_array_almost_equal(kpca.alphas_[:2, :10], kpca2.alphas_[:2, :10])

AssertionError: 
Arrays are not almost equal to 6 decimals

(mismatch 90.0%)
 x: array([[-0.045235, -0.074163,  0.      ,  0.      ,  0.      , -0.157852,
         0.      ,  0.      ,  0.      ,  0.      ],
       [-0.009169,  0.002354, -0.07098 ,  0.003515, -0.015677, -0.354438,
        -0.18636 ,  0.056072,  0.017528, -0.178867]])
 y: array([[ 0.045235, -0.074163, -0.005973,  0.005636, -0.000913, -0.031506,
         0.006255,  0.047652, -0.015241, -0.018319],
       [ 0.009169,  0.002354, -0.175217, -0.019426,  0.011081, -0.130316,
        -0.056403, -0.004699, -0.175928, -0.041456]])

You could see some signs are flipped, some numbers rounded off, etc. Unless I'm confused about the theory of Kernel PCAs, this shouldn't be raising an error, right?

@mblondel, what do you think?

@lucasdavid
Copy link
Contributor

I don't think so, ARPACK's result depends on the initial guess. Have you tried to test this passing random_state=0 or eigen_solver='dense'? What happens then?

@mblondel
Copy link
Member

I agree with @lucasdavid that this is most likely the culprit.

@hlin117
Copy link
Contributor Author

hlin117 commented Dec 14, 2015

@lucasdavid: Well, it looks like passing in eigen_solver="dense" doesn't work =\

>>> from sklearn.datasets import make_circles
>>> X_circle, y_circle = make_circles(400, random_state=0, factor=0.3, noise=0.15)
>>> from sklearn.decomposition import KernelPCA
>>> kpca = KernelPCA(eigen_solver='dense').fit(X_circle)
>>> kpca2 = KernelPCA(n_components=53, eigen_solver='dense').fit(X_circle)
>>> from sklearn.utils.testing import assert_array_almost_equal
>>> assert_array_almost_equal(kpca.lambdas_[:50], kpca2.lambdas_[:50])
>>> assert_array_almost_equal(kpca.alphas_[:2, :10], kpca2.alphas_[:2, :10])

AssertionError:
Arrays are not almost equal to 6 decimals

(mismatch 90.0%)
 x: array([[-0.045235, -0.074163,  0.      ,  0.      ,  0.      , -0.157852,
         0.      ,  0.      ,  0.      ,  0.      ],
       [-0.009169,  0.002354, -0.07098 ,  0.003515, -0.015677, -0.354438,
        -0.18636 ,  0.056072,  0.017528, -0.178867]])
 y: array([[ 0.045235, -0.074163, -0.005973,  0.005636, -0.000913, -0.031506,
         0.006255,  0.047652, -0.015241, -0.018319],
       [ 0.009169,  0.002354, -0.175217, -0.019426,  0.011081, -0.130316,
        -0.056403, -0.004699, -0.175928, -0.041456]])

Also, it looks like KernelPCA doesn't accept the random_state=None keyword argument! @mblondel: You think this would be a good feature request?

@mblondel
Copy link
Member

mblondel commented Dec 14, 2015 via email

@lucasdavid
Copy link
Contributor

The docs seem to be outdated, if you look at the code you'll see the parameter already exists.
Also, I've noticed that you are asking for n_components=53, but circles only have two features. I think it would be a cleaner test to run one of them with n_components=1 and the other with n_components=2, and then comparing the first alphas.
With that being said, I ran your code with random_state=0 and it still fails (showing first column of alphas_):

(mismatch 100.0%)
 x: array([[  4.523470e-02],
       [  9.168905e-03],
       [  1.961695e-02],...
 y: array([[ -4.523470e-02],
       [ -9.168905e-03],
       [ -1.961695e-02],...

Please let me know if you find out what's happening. This is too interesting.

@hlin117
Copy link
Contributor Author

hlin117 commented Dec 15, 2015

Thanks for the check, @lucasdavid.

Also, I've noticed that you are asking for n_components=53, but circles only have two features. I think it would be a cleaner test to run one of them with n_components=1 and the other with n_components=2, and then comparing the first alphas.

I wanted to use n_components=53 to capture a lot of the output in the data. the 10 numbers i chose to pick were totally arbitrary.

I think the negative values in that test are permissible. According to the PCA docs:

Due to implementation subtleties of the Singular Value Decomposition (SVD), which is used in this implementation, running fit twice on the same matrix can lead to principal components with signs flipped (change in direction). For this reason, it is important to always use the same estimator object to transform data in a consistent fashion.

However, I'm still getting precision errors from this:

>>> from sklearn.datasets import make_circles
>>> X_circle, y_circle = make_circles(400, random_state=0, factor=0.3, noise=0.15)
>>> from sklearn.decomposition import KernelPCA
>>> kpca = KernelPCA(random_state=0).fit(X_circle)
>>> kpca2 = KernelPCA(n_components=53, random_state=0).fit(X_circle)
>>> from sklearn.utils.testing import assert_array_almost_equal
>>> assert_array_almost_equal(kpca.lambdas_[:50], kpca2.lambdas_[:50])
>>> assert_array_almost_equal(kpca.alphas_[:2, :10], kpca2.alphas_[:2, :10])
AssertionError:
Arrays are not almost equal to 6 decimals

(mismatch 80.0%)
 x: array([[  4.52347019e-02,  -7.41626885e-02,   0.00000000e+00,
          0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
          9.93712254e-01,   0.00000000e+00,   0.00000000e+00,...
 y: array([[ 0.0452347 , -0.07416269, -0.0008139 , -0.00213575,  0.00693525,
        -0.02749284,  0.05635691,  0.09004436,  0.00364872,  0.01541384],
       [ 0.00916891,  0.00235399, -0.04403946,  0.02472058,  0.07725798,
        -0.1409346 ,  0.38463443,  0.22568812,  0.01437247,  0.1645341 ]])

I'm going to update the OP with this error.

@hlin117
Copy link
Contributor Author

hlin117 commented Dec 15, 2015

Does this issue seem worthy of the "bug" label?

@lucasdavid
Copy link
Contributor

I'm not seeing where the second row of your x array starts, but considering this is happening for me, I'm going to assume that you're getting the same alphas_ for the first two columns. That is, assert_array_almost_equal(kpca.alphas_[:, :2], kpca2.alphas_[:, :2]) passes, whereas assert_array_almost_equal(kpca.alphas_[:, :3], kpca2.alphas_[:, :3]) fails:

(mismatch 33.33333333333333%)
 x: array([[ 0.045235, -0.074163, -0.01721 ],
       [ 0.009169,  0.002354, -0.006979],
       [ 0.019617, -0.030738,  0.062522],...
 y: array([[ 0.045235, -0.074163, -0.006221],
       [ 0.009169,  0.002354, -0.005059],
       [ 0.019617, -0.030738, -0.227136],...

Does this issue seem ...

I'm haven't studied KernelPCA appropriately, but I know the alphas are the coefficients used to linearly combine the projections of the samples to form the eigenvectors (disconsider that. Still, I maintain my guess bellow). Considering that only two eigenvectors are sufficient to generate the original space of circles, we only need "correct" values in the first two columns of alphas_ to generates the correct two first eigenvectors (eq. 5).

Additionally, replacing circles with the s_curve (3D) makes the first 3 columns of alphas_ to be equal (but not the following ones). Replacing circles with the iris (4D) makes the first 4 columns of alphas_ to be equal (but not the following ones).

My only guess is that the algorithm was implemented to optimize a maximum of N eigenvalues/vectors (and therefore alphas_), since one would hardly (I think?) use dimensionality reduction to move from the 2D to a higher dimensional space. Hence I BELIEVE that there's no bug.

Let me know if you don't agree with this. Also, the algorithm didn't flipped the signals once during my tests. Was it updated these last days?

@thomasjpfan
Copy link
Member

I think this was fixed in #13241. Current lambdas_ and alphas_ are deprecated, but the corresponding code snippet now passes:

from sklearn.datasets import make_circles
X_circle, y_circle = make_circles(400, random_state=0, factor=0.3, noise=0.15)

from sklearn.decomposition import KernelPCA
from numpy.testing import assert_allclose
kpca = KernelPCA(random_state=0).fit(X_circle)
kpca2 = KernelPCA(n_components=53, random_state=0).fit(X_circle)

assert_allclose(kpca.eigenvalues_, kpca2.eigenvalues_[:2])
assert_allclose(kpca.eigenvectors_, kpca2.eigenvectors_[:, :2])

The [:2] slicing at the end is to choose the first 2 components in kpca2 so it can be compared to kpca, which only has 2 components. With that in mind, I am closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants