Skip to content

ENH: Faster Eigen Decomposition For Isomap & KernelPCA #31247

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

yaichm
Copy link
Contributor

@yaichm yaichm commented Apr 24, 2025

Fixes #31246

Implemented randomized_eigh(selection='values') and integrated it into KernelPCA and Isomap

  • Introduced a new eigenvalue decomposition function randomized_eigh(values) for faster computation.

  • Integrated this solver into both KernelPCA and Isomap as an alternative to dense solvers.

  • Added comprehensive tests in extmath.py to validate the decomposition accuracy.

  • Benchmarked against existing solvers, comparing:

    • Execution time in KernelPCA and Isomap
    • Reconstruction error in Isomap

    The benchmark result graphs comparing execution time and reconstruction error with existing solvers will be added in the comment below.

Copy link

github-actions bot commented Apr 24, 2025

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: e8656f1. Link to the linter CI: here

@yaichm yaichm changed the title ENH: Faster Eigen Decomposition For & KernelPCA ENH: Faster Eigen Decomposition For Isomap & KernelPCA Apr 24, 2025
@ogrisel
Copy link
Member

ogrisel commented Apr 25, 2025

Thanks for the PR @yaichm.

Please follow the instructions of the automated comment above to resolve the failing linter continuous integration.

If you need help (e.g. the instructions are not clear enough), please let us know with a specific description of your attempt at resolving the problems and the problem you faced.

The benchmark result graphs comparing execution time and reconstruction error with existing solvers will be added in the comment below.

Looking forward to it. Please feel free to ping me once done.

@Dlimim
Copy link

Dlimim commented Apr 26, 2025

KPCA_Execution_Time_vs_components

In this Kernel PCA benchmark (focus on the left side of the graph), our custom randomized_value solver shows better scalability compared to standard solvers when the number of components is large.

@Dlimim
Copy link

Dlimim commented Apr 26, 2025

KPCA_Execution_Time_vs_Samples

When increasing the number of samples, our randomized_value solver consistently achieves much lower execution times compared to the full solver. Even with large datasets, randomized_value remains highly efficient and scalable.

@Dlimim
Copy link

Dlimim commented Apr 26, 2025

Isomap_Solvers_Comparaison

In this Isomap benchmark , for a small number of components, both solvers show comparable execution times. However, as the number of components increases ( > 10 ), randomized_value significantly outperforms the full solver, achieving much faster execution for large datasets.

@Dlimim
Copy link

Dlimim commented Apr 26, 2025

Isomap_Solvers

The projections obtained with the auto solver and the randomized solver are visually very similar, confirming that the randomized solver preserves the structure and quality of the embedding. This highlights its reliability alongside its faster execution.

@Dlimim
Copy link

Dlimim commented Apr 26, 2025

Isomap_Reconstruction_Erreur

The reconstruction error is very similar between the auto and randomized solvers across different sample sizes. This confirms that the randomized solver preserves the quality of the reconstruction while offering faster execution.

@yaichm
Copy link
Contributor Author

yaichm commented Apr 27, 2025

I’ve followed the automated instructions and fixed the linter issues.

The benchmark result graphs have also been added.

@ogrisel , feel free to review whenever you are available.

@smarie
Copy link
Contributor

smarie commented Apr 28, 2025

Thanks to the team for finalizing these results !

So to summarize

  • on KernelPCA, we see similar results with method 5.3 (randomized eigenvalues) than with the "trick" of using 4.3 (randomized SVD). This makes me quite confident that the implementation is correct
  • on Isomap, where the "trick" could not be used (since matrices are not guaranteed to be PSD),
    • when the number of components is low (2), there is no drawback: the speed is the same than the current method based on arpack, and the result on sample dataset is identical (same visual figure, same reconstruction error)
    • we see clear speed benefits when the number of components is high (right figure of this message, where 50 components are selected). @yaichm It would be interesting to see the reconstruction error corresponding to this situation, to make sure that the speed gain was not obtained at the cost of increasing the reconstruction error

@yaichm
Copy link
Contributor Author

yaichm commented Apr 28, 2025

As requested by @smarie, here is the image showing the reconstruction error for 50 components.
Capture d’écran du 2025-04-28 22-48-25

@smarie
Copy link
Contributor

smarie commented May 5, 2025

@ogrisel I think the team is now ready for a first review (see summary #31247 (comment))

@@ -198,10 +198,6 @@ def test_randomized_eigsh(dtype):
# eigenvectors
assert eigvecs.shape == (4, 2)

# with 'value' selection method, the negative eigenvalue does not show up
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace this section with

    eigvals, eigvecs = _randomized_eigsh(X, n_components=2, selection="value")
    # eigenvalues
    assert eigvals.shape == (2,)
    assert_array_almost_equal(eigvals, [3.0, 1.0])  # signed ordering: positive eig
    # eigenvectors
    assert eigvecs.shape == (4, 2)

Comment on lines +282 to +283
# make a random PSD matrix
X = make_sparse_spd_matrix(n_features, random_state=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is PSD, but we should also check non-PSD. So let's create a symmetric random matrix instead. If you wish you can test the two cases

  • add a @pytest.mark.parametrize("is_psd", (True, False))
  • in the test, switch on if is_psd: to create the random X.

Comment on lines +783 to +785
with bounded error. Unlike the 'module' strategy, it works efficiently with
non-positive semidefinite matrices, handling both positive and negative
eigenvalues directly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather suggest to have a simple comment here :

Suggested change
with bounded error. Unlike the 'module' strategy, it works efficiently with
non-positive semidefinite matrices, handling both positive and negative
eigenvalues directly.
with bounded error. Unlike the 'module' strategy, it returns the top `k` eigenvalues by decreasing value: all the positive ones first, then the negative ones if any.

And to add a more detailed comment in the Strategy "module" description:

Strategy 'module':
    (...existing definition...)
    Unlike the 'value' strategy, this returns the top `k` eigenvalues by decreasing **module**. Therefore, when `M` is non-positive semidefinite large negative eigenvalues will be returned before small positive ones. 
    When `M` is psd both strategies lead to the same results as all eigenvalues are positive.

# -- eigenvectors comparison
assert eigvecs_lapack.shape == (n_features, k)
dummy_vecs = np.zeros_like(eigvecs).T
eigvecs, _ = svd_flip(eigvecs, dummy_vecs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
eigvecs, _ = svd_flip(eigvecs, dummy_vecs)
# Fix the signs so that both have the same
eigvecs, _ = svd_flip(eigvecs, dummy_vecs)

@@ -169,7 +171,7 @@ class Isomap(ClassNamePrefixFeaturesOutMixin, TransformerMixin, BaseEstimator):
"n_neighbors": [Interval(Integral, 1, None, closed="left"), None],
"radius": [Interval(Real, 0, None, closed="both"), None],
"n_components": [Interval(Integral, 1, None, closed="left")],
"eigen_solver": [StrOptions({"auto", "arpack", "dense"})],
"eigen_solver": [StrOptions({"auto", "arpack", "dense", "randomized_value"})],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"randomized" could probably be more straightforward as a parameter name.

@@ -57,6 +57,8 @@ class Isomap(ClassNamePrefixFeaturesOutMixin, TransformerMixin, BaseEstimator):
'dense' : Use a direct solver (i.e. LAPACK)
for the eigenvalue decomposition.

'randomized_value' : Use randomized solver in order to reduce complexity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"randomized" could probably be more straightforward as a parameter name.

@@ -263,7 +263,9 @@ class KernelPCA(ClassNamePrefixFeaturesOutMixin, TransformerMixin, BaseEstimator
"kernel_params": [dict, None],
"alpha": [Interval(Real, 0, None, closed="left")],
"fit_inverse_transform": ["boolean"],
"eigen_solver": [StrOptions({"auto", "dense", "arpack", "randomized"})],
"eigen_solver": [
StrOptions({"auto", "dense", "arpack", "randomized", "randomized_value"})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we really add randomized_value to kernel pca solvers, it should be declared in the docstring of the parameters (as for Isomap). If we do this it should be clear that we expect the results to be equivalent in speed and in quality to the current "randomized"(svd) method, except when a precomputed Kernel matrix is provided and is actually not PSD. In this case current "randomized" may not lead to correct results, or may even raise an error (if a negative eigenvalue is returned by the solver)

@@ -363,6 +365,14 @@ def _fit_transform_in_place(self, K):
random_state=self.random_state,
selection="module",
)
elif eigen_solver == "randomized_value":
self.eigenvalues_, self.eigenvectors_ = _randomized_eigsh(
Copy link
Contributor

@smarie smarie May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add comments to this branch and the previous branch of the if so that it is easier to remember the difference while maintaining the code

# Use selection='module' for SVD decomposition (wrong if non-PSD precomputed kernel)

and

Suggested change
self.eigenvalues_, self.eigenvectors_ = _randomized_eigsh(
# Use selection='value' for direct eigenvalue decomposition (robust to non-PSD)
self.eigenvalues_, self.eigenvectors_ = _randomized_eigsh(

random_state=None,
):
"""
Approximate eigenvalue decomposition of a Hermitian matrix A ≈ U Λ U*.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question to core devs : is symbol ≈ allowed in docstrings ?

Suggested change
Approximate eigenvalue decomposition of a Hermitian matrix AU Λ U*.
Approximate eigenvalue decomposition of a Hermitian matrix AU Λ U*.

**Halko, Martinsson, and Tropp (2011)**: *Finding Structure with Randomness: Probabilistic Algorithms for Constructing
Approximate Matrix Decompositions*. This method provides a faster alternative to existing eigen decomposition techniques.

- Integrated :func:`eigen_decomposition_one_pass` into :class:`sklearn.manifold.Isomap` and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Integrated :func:`eigen_decomposition_one_pass` into :class:`sklearn.manifold.Isomap` and
- Integrated :func:`randomized_eigen_decomposition` into :class:`sklearn.manifold.Isomap` and

- Integrated :func:`eigen_decomposition_one_pass` into :class:`sklearn.manifold.Isomap` and
:class:`sklearn.decomposition.KernelPCA` as an additional option for eigen decomposition.

- Added a test suite comparing the new method to existing solvers (:obj:`arpack`, :obj:`dense`, etc.), ensuring numerical
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Added a test suite comparing the new method to existing solvers (:obj:`arpack`, :obj:`dense`, etc.), ensuring numerical
- Added a test suite comparing the new method to existing solvers (``'arpack'``, ``'dense'``, etc.), ensuring numerical

Comment on lines +16 to +17
by :user: `Sylvain Marié<@smarie>`, `Mohamed yaich<@yaichm>`, `Oussama Er-rabie<@eroussama>`, `Mohamed Dlimi<@Dlimim>`,
`Hamza Zeroual<@HamzaLuffy>` and `Amine Hannoun<@AmineHannoun>`.
Copy link
Contributor

@smarie smarie May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You guys did most of the work and deserve to appear first :)

Suggested change
by :user: `Sylvain Marié<@smarie>`, `Mohamed yaich<@yaichm>`, `Oussama Er-rabie<@eroussama>`, `Mohamed Dlimi<@Dlimim>`,
`Hamza Zeroual<@HamzaLuffy>` and `Amine Hannoun<@AmineHannoun>`.
by `Mohamed yaich<@yaichm>`, `Oussama Er-rabie<@eroussama>`, `Mohamed Dlimi<@Dlimim>`,
`Hamza Zeroual<@HamzaLuffy>`, `Amine Hannoun<@AmineHannoun>` and :user: `Sylvain Marié<@smarie>`.

@@ -0,0 +1,17 @@
Feature
------------

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally with this changelog I have got the impression that this is too big/detailed compared to what sklearn core team expects. @lesteve or @ogrisel let us know if there is some guideline here, some reference example PR doing it right, or feel free to propose an alternative text directly


What you can observe:
----------------------
- The `auto` solver provides a reference solution.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'auto' behaviour should rather be improved to rely on some heuristic in order to solve the problem faster automatically. See what has been done in #27491

@ogrisel
Copy link
Member

ogrisel commented May 26, 2025

Sorry for the lack of feedback, I plan to come back to revew this PR soonish. In the meantime, could you please expand the right handside of this benchmark plot: #31247 (comment)

to go to even larger number of data points and/or components? Also, could you please try to use log scaled axis to ease extrapolation assuming the gap between the two method is not a fixed constant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Faster Eigen Decomposition for Isomap & KernelPCA
4 participants